pspatreg is a package that fits penalized spline (PS) semiparametric static and dynamic spatial autoregressive models via Restricted (or Residual) Maximum Likelihood (REML) and Maximum Likelihood (ML). This approach combines penalized regression spline methods (Paul HC Eilers, Marx, and Durbán 2015) with standard spatial autoregressive models (such as SAR, SEM and SDM). These types of models are thoroughly discussed in Mı́nguez, Basile, and Durbán (2020); see also Montero, Mı́nguez, and Durbán (2012), Basile et al. (2014), and Hoshino (2018).
These models are very flexible since they make it possible to include within the same specification: i) spatial autoregressive terms (i.e. spatial lags of dependent and independent variables as well as spatial error terms) to capture spatial interaction or network effects; ii) time lags of the dependent variable to capture persistence effects; iii) parametric and nonparametric (smooth) terms to identify nonlinear relationships between the response variable and the covariates; iv) a spatio-temporal trend, i.e. a smooth interaction between the spatial coordinates and the time trend, to capture site-specific nonlinear time trends.
The proposed method also allows the user to apply an ANOVA decomposition of the spatio-temporal trend into several components (spatial and temporal main effects, and second- and third-order interactions between them), which gives further insights into the dynamics of the data. Thus, we use the acronym PS-ANOVA-SAR (SEM, SDM, SLX) for the new data generating process (DGP) proposed. The use of nested B-spline bases for the interaction components of the spatio-temporal trend contributes to the efficiency of the fitting procedure without compromising the goodness of fit of the model. Finally, we also consider an extension of the PS-ANOVA-SAR (SEM, SDM, SLX) including the time lag of the dependent variable (dynamic spatial model) and/or a first-order time series autoregressive term process (AR1) in the noise to accommodate residual serial correlation.
In a very general form, the semiparametric spatial autoregressive model reads as:
\[ y_{it}=\alpha y_{it-1}+\rho \sum_{j=1}^N w_{ij,N} y_{jt} + \pi \sum_{j=1}^N w_{ij,N} y_{jt-1}+ \\ \sum_{k=1}^K \beta^*_k x^*_{k,it} + \sum_{k=1}^K \gamma^*_k \sum_{j=1}^N w_{ij,N} x^*_{k,jt}+ \sum_{\delta=1}^\Delta g_\delta(x_{\delta, it}) + \sum_{\delta=1}^\Delta h_\delta\left(\sum_{j=1}^N w_{ij,N} x_{\delta,jt}\right) + \\ \widetilde{ f}(s_{1i},s_{2i},\tau_t)+\epsilon_{it} \]
\[\epsilon_{it}=\theta \sum_{j=1}^N w_{ij,N}\epsilon_{jt}+\phi \epsilon_{it-1}+u_{it}\] \[u_{it} \sim i.i.d.(0,\sigma^2_u)\]
where \((y_{it},x^*_{1,it},...,x^*_{K,it},x_{1,it},...,,x_{\Delta,it})\) are data observed for a sample of spatial units \(i\) (\(i=1,...,N\)) in each time period \(t=1,...,T\). The terms \((s_{1i},s_{2i})\) indicate the spatial coordinates (latitude and longitude) of the spatial unit \(i\) (either a spatial point or the centroid of a spatial polygon), and \(W_{ij}\) is an element of a row-standardized spatial weights matrix. The terms included into the model are:
\(y_{it-1}\) = the time lag of \(y_{it}\);
\(\sum_{j=1}^N w_{ij,N} y_{it}\) = the contemporaneous spatial lag of \(y_{it}\);
\(\sum_{j=1}^N w_{ij,N} y_{it-1}\) = the time lag of the spatial lag of \(y_{it}\);
\(\alpha\), \(\rho\), \(\pi\), \(\beta^*_k\), and \(\gamma^*_k\) = fixed parameters;
\(\sum_{k=1}^K \beta^*_k x^*_{k,it}\) and \(\sum_{k=1}^K \gamma^*_k \sum_{j=1}^N w_{ij,N} x^*_{k,jt}\) = parametric linear terms of some covariates \(x_{k,it}\) and of their spatial lags;
\(g_\delta(.)\) and \(h_\gamma(.)\) = nonparametric smooth functions of other covariates and of their spatial lags (they can also accommodate varying coefficient terms, smooth interaction between covariates, factor-by-curve intercations, and so on);
\(\widetilde{ f}(s_{1i},s_{2i},\tau_t)\) = an unknown nonparametric spatio-temporal trend;
\(\epsilon_{it}\) = an idiosyncratic error term that can follow a spatial autoregressive process \(\epsilon_{it}=\theta \sum_{j=1}^N w_{ij,N}\epsilon_{it}+u_{it}\) with \(u_{it}\sim N(0,\sigma^2)\) (SEM), a time series autoregressive AR(1) process, i.e., \(\epsilon_{it}=\phi \epsilon_{it-1}+u_{it}\) with \(u_{it}\sim N(0,\sigma^2)\), or both.
Obviously, this very general specification is hardly suitable in real data applications. However, it is worth noticing that it nests most of the spatial additive models which can be used in practice. For example, a more suitable semiparametric static model reads as:
\[y_{it}=\rho \sum_{j=1}^N w_{ij,N} y_{jt} + \sum_{k=1}^K \beta^*_k x^*_{k,it} + \sum_{\delta=1}^\Delta g_\delta(x_{\delta, it}) + \widetilde{ f}(s_{1i},s_{2i},\tau_t)+\epsilon_{it}\]
\[\epsilon_{it}=\phi \epsilon_{it-1}+u_{it}\] \[u_{it} \sim i.i.d.(0,\sigma^2_u)\] This semiparametric SAR model turns out to be extremely useful to capture interactive spatial and temporal unobserved heterogeneity when the last one is smoothly distributed over space and time (Mı́nguez, Basile, and Durbán 2020). The dynamic extension (including \(y_{it-1}\) and \(\sum_{j=1}^N w_{ij,N} y_{it-1}\)) is also very promising and merits further theoretical investigation. Finally, the following semiparametric SAR model is very useful for modeling cross-setional spatial data taking into account nonlinearities, spatial dependence and spatial heterogeneity:
\[y_{i}=\rho \sum_{j=1}^N w_{ij,N} y_{j} + \sum_{k=1}^K \beta^*_k x^*_{k,i} + \sum_{\delta=1}^\Delta g_\delta(x_{\delta, i}) + \widetilde{ f}(s_{1i},s_{2i})+\epsilon_{i}\]
\[\epsilon_{i}\sim i.i.d.(0,\sigma^2_\epsilon)\] In many situations the spatio-temporal trend to be estimated can be complex, and the use of a single multidimensional smooth function may not be flexible enough to capture the structure in the data. To solve this problem, an ANOVA-type decomposition of \(\widetilde{ f}(s_{1i},s_{2i},\tau_t)\) can be used, where spatial and temporal main effects, and second- and third-order interactions between them can be identified:
\[\widetilde{ f}(s_{1i},s_{2i},\tau_t)=f_1(s_{1i})+f_2(s_{2i})+f_{\tau}(\tau_t)+ \\ f_{1,2}(s_{1i},s_{2i})+f_{1,\tau}(s_{1i},\tau_t)+f_{2,\tau}+(s_{2i},\tau_t)+f_{1,2,\tau}(s_{1i},s_{2i},\tau_t)\]
First, the geoadditive terms given by \(f_1(s_{1i}),f_2(s_{2i}),f_{1,2}(s_{1i},s_{2i})\) work as control functions to filter the spatial trend out of the residuals, and transfer it to the mean response in a model specification. Thus, they make it possible to capture the shape of the spatial distribution of \(y_{it}\), conditional on the determinants included in the model. These control functions also isolate stochastic spatial dependence in the residuals, that is, spatially autocorrelated unobserved heterogeneity. Thus, they can be regarded as an alternative to the use of individual regional dummies to capture unobserved heterogeneity, as long as such heterogeneity is smoothly distributed over space. Regional dummies peak at significantly higher and lower levels of the mean response variable. If these peaks are smoothly distributed over a two-dimensional surface (i.e., if unobserved heterogeneity is spatially autocorrelated), the smooth spatial trend is able to capture them.
Second, the smooth time trend, \(f_{\tau}(\tau_t)\), and the smooth interactions between space and time - \(f_{1,\tau}(s_{1i},\tau_t),f_{2,\tau},(s_{2i},\tau_t),f_{1,2,\tau}(s_{1i},s_{2i},\tau_t)\) - work as control functions to capture the heterogeneous effect of common shocks. Thus, conditional on a smooth distribution of the spatio-temporal heterogeneity, the PS-ANOVA-SAR (SDM, SEM, SLX) model works as an alternative to the models proposed by Bai and Li (2013), Shi and Lee (2018), Pesaran and Tosetti (2011), Bailey, Holly, and Pesaran (2016) and Vega and Elhorst (2016) based on extensions of common factor models to accommodate both strong cross-sectional dependence (through the estimation of the spatio-temporal trend) and weak cross-sectional dependence (through the estimation of spatial autoregressive parameters).
Furthermore, this framework is also flexible enough to control for the linear and nonlinear functional relationships between the dependent variable and the covariates as well as the heterogeneous effects of these regressors across space. The model inherits all the good properties of penalized regression splines, such as coping with missing observations by appropriately weighting them, and straightforward interpolation of the smooth functions.
All the non-parametric terms are modeled using Penalized-splines (P-Splines, P. H. Eilers and Marx (1996)). This methodology assumes that the unknown function to be estimated is smooth, and can be represented as a linear combination of basis functions,
\[f(x)=\sum_j \theta_j B_j(x),\] a popular election is the use of cubic B-spline basis (De Boor 1977). In the case of multidimensional functions, the basis is calculated as tensor products of the marginal basis functions in each dimension. The smoohness of the curve/surface is controlled by adding to the likelihood a penalty tunned by a smoothing parameter, \(\lambda\) (the number of smoothing parameters used is equal to the dimension of the function to be estimated). The penalty can take many forms, depending of the prior knowledge on the curve/surface. The most common choice is to use second order differences on adjacent B-spline coefficients (see Mı́nguez, Basile, and Durbán 2020). In the case of a univariate function, \(f(x)\), the penalty is:
\[ Penalty= \lambda\sum_j \left ( \theta_j-2\theta_{j-1}+\theta_{j-2}\right )^2.\]
As we mentioned above, the model is estimated via REML or ML. In order to be able to estimate simultaneously all parameters in the model (including the smothing parameters) we make use of the fact that a P-spline can be reparameterized as a mixed model, and so, all the methodology debeloped in this context can be use for estimation and inference. Furthermore, the mixed model setting allows us to impose the necessary constraints so that all terms in the model are identifiable, as well as, to add spatial (or other) random effects if they are necessary.
To speed up computations, we use a modification of the SOP (Separation of Anisotropic Penalties) algorithm derived by Rodriguez-Alvarez et al. (2015) (for variance components estimation). Also, the use of nested B-spline bases (Lee, Durban, and Eilers 2013) for the interaction components of the spatio-temporal trend contributes to the efficiency of the fitting procedure without compromising the goodness of fit of the model. See Mı́nguez, Basile, and Durbán (2020) for more details.
This package is available in GitHub (https://github.com/rominsal/pspatreg) and can be installed in the usual way.1
The package is accompanied by other two vignettes
B_Examples_pspatreg_CS_data.html and
C_Examples_pspatreg_Panel_data.html, introducing the
application of pspatreg
to cross-sectional and panel data.
Here, we introduce some basic general information about the package.
pspatfit()
The main function in the pspatreg
package is
pspatfit()
, which estimates spatio-temporal pernalized
spline spatial regression models using either the Restricted Maximum
Likelihood (REML) method or the Maximum Likelihood (ML) method. In its
generic form pspatfit()
appears as:
pspatfit(formula, data, na.action, listw = NULL,type = "sim", method = "eigen", Durbin = NULL, zero.policy = NULL,
interval = NULL,trs = NULL, cor = "none", dynamic = FALSE, control = list())
The function pspatfit()
returns a list of quantities of
class pspat
, including coefficients of the parametric terms
and their standard errors, estimated coefficients corresponding to
random effects in mixed model and their standard errors, equivalent
degrees of freedom, residuals, fitted values, etc. A wide range of
standard methods is also available for the pspat
objects,
including print()
, summary()
,
coef()
, vcov()
, anova()
,
fitted()
, residuals()
, and
plot()
.
formula
The argument formula
within the function
pspatfit()
is formula similar to GAM specification
including parametric and non-parametric terms. Parametric covariates are
included in the usual way and non-parametric p-spline smooth terms are
specified using pspl(.)
and pspat(.)
for the
non-parametric covariates and spatial or spatio-temporal trends,
respectively. For example
formula <- y ~ x1 + x2 + pspl(x3, nknots = 15) + pspl(x4, nknots = 20) +
pspt(long, lat, year, nknots = c(18,18,8),
psanova = TRUE,
nest_sp1 = c(1, 2, 3),
nest_sp2 = c(1, 2, 3),
nest_time = c(1, 2, 2))
In the example above, the model includes two parametric terms, two
nonparametric terms, and a spatio-temporal trend (with long and lat as
spatial coordinates, and year as temporal coordinate). The dimension of
the basis function both in pspl(.)
and pspt(.)
is defined by nknots
. This term should not be less than the
dimension of the null space of the penalty for the term (see
null.space.dimension
and choose.k
from package
mgcv
to know how to choose nknots
). The
default number of nknots
in pspl(.)
is 10, but
in this example we have chosen 15 nknots
for
g_1(x_3)
and 20 nknots
for
g_2(x_4)
. The default number of nknots
in
pspt(.)
is c(10,10,5)
, but we have chosen
c(18,18,8)
.
In this example we also adopt an ANOVA decomposition of the
spatio-temporal trend (choosing psanova = TRUE
). Each
effect has its own degree of smoothing allowing a greater flexibility
for the spatio-temporal trend. Calculating up to third-order
interactions can be computationally expensive. To address this problem,
we can select subgroups of interaction effects for the second- and
third-order effects. To define these subgroups, we use three parameters
available in pspt()
: nest_sp1
,
nest_sp2
, and nest_time
. These parameters
indicate the divisors of the nknots
parameters. For
example, if we set nest_sp1 = c(1,2,3)
, we will have all
knots for the s_1
effect, 18/2 for each second-order
effects with s_1
, and 18/3 nots for the third order effect
with s_1
.2
If we want to set any main effect to 0
, we must set the
parameters f1_main
, f2_main
or
ft_main
to FALSE
, The default is
TRUE
. We can also exclude second- or third-order effects
setting f12_int
, f1t_int
,
f2t_int
, f12t_int
to FALSE
.
Type
Using the argument Type
we can choose different spatial
model specifications: "sar"
, "sem"
,
"sdm"
, "sdem"
, "sarar"
, or
"slx"
. When creating a "slx"
,
"sdem"
or "sdm"
model, we need to include the
formula of the durbin part in the Durbin parameter.
The argument data
must contain all the variables
included in parametric and non-parametric terms of the model. If a
pspat(.)
term is included in formula
, the data
must contain the spatial and temporal coordinates specified in
pspat(.)
. In this case, the coordinates must be ordered
choosing time as fast index and spatial coordinates as slow indexes.
Both data.frame
and sf
class objects can be
used as data
inputs.3 sf
objects are recommended
since they allow the user to map spatial trends. In our demos we use two
datasets in sf
version.
Plotting the estimated non-parametric smooth terms represents an
important step in semiparametric regression analyses. First, the
function fit_terms()
computes estimated non-parametric
smooth terms. Then, the functions plot_sp2d()
and
plot_sp3d()
are used to plot and map spatial and
spatio-temporal trends, respectively, while plot_sptime()
is used to plot the time trend for PS-ANOVA models in 3d; finally, and
plot_terms()
is used to plot smooth non-parametric
terms.
In the case of a semiparametric model without the spatial lag of the
dependent variable (PS model), if all regressors are manipulated
independently of the errors, \(\widehat{g}_\delta(x_{\delta, it})\) can be
interpreted as the conditional expectation of \(y\) given \(x_{\delta}\) (net of the effect of the
other regressors). @blu.pow.03 use the term Average Structural Function
(ASF) with reference to these functions. Instead, in PS-SAR, PS-SDM or
in PS-SARAR model, when \(\rho\) is
different from zero, the estimated smooth functions cannot be
interpreted as ASF. Taking advantage of the results obtained for
parametric SAR, we can compute the total smooth effect (total–ASF) of
\(x_{\delta}\) as:
\[\widehat{g}_{\delta}^{T}\left(x_{\delta}\right)=\Sigma_{q}
\left[\textbf{I}_{n}-\widehat{\rho}\textbf{W}_{n}\right]^{-1}_{ij}
b_{\delta q}(x_{\delta})\widehat{\beta}_{\delta q}\]
where \(b_{\delta q}(x_{\delta})\) are P-spline basis functions, and \(\widehat{\beta}_{\delta q}\) the corresponding estimated parameters.
We can also compute direct and indirect (or spillover) impacts of smooth terms in the PS-SAR case as:
\[\widehat{g}_{\delta}^{D}\left(x_{\delta}\right)=\Sigma_{q} \left[\textbf{I}_{n}-\widehat{\rho}\textbf{W}_{n}\right]^{-1}_{ii} b_{\delta q}(x_{k})\widehat{\beta}_{\delta q}\]
\[\widehat{g}_{\delta}^{I}\left(x_{\delta}\right)=\widehat{g}_{\delta}^{T}\left(x_{\delta}\right)-\widehat{g}_{\delta}^{D}\left(x_{\delta}\right)\]
Similar expressions can be provided for the direct, indirect and total impacts of the PS-SDM.
The function impactspar()
computes direct, indirect and
total impacts for continuous parametric covariates using the standard
procedure for their computation (LeSage and Pace
2009).
The function impactsnopar()
computes direct, indirect
and total impacts functions for continuous non-parametric covariates,
while the function plot_impactsnopar()
is used to plot
these impacts functions. It is worth noticing that total, direct and
indirect impacts are never smooth over the domain of the variable \(x_\delta\) due to the presence of the
spatial multiplier matrix in the algorithm for their computation.
Indeed, a wiggly profile of direct, indirect and total impacts would
appear even if the model were linear. Therefore, in the spirit of the
semiparametric approach, we included the possibility of applying a
spline smoother to obtain smooth curves (using the argument
smooth=TRUE
in the function
plot_impactsnopar()
).
To install any R package from GitHub you need to have
previously installed devtools
package from CRAN. Then
execute the commands library(devtools)
, to load
devtools
, and install
github("rominsal/pspatreg")
to install
pspatreg
package:
devtools::install_github("rominsal/pspatreg")
.↩︎
In most empirical cases, the main effects are more flexible than interaction effects and therefore the number of knots in B-Spline bases for interaction effects do not need to be as large as the number of knots for the main effects (Lee, Durban, and Eilers 2013).↩︎
sf
means simple features of spatial vector
objects. The geographic vector data model is based on points located
within a coordinate reference system (CRS). Points can represent
self-standing features (e.g., the location of a house) or they can be
linked together to form more complex geometries such as lines and
polygons. Most point geometries contain only two dimensions
x
and y
(3-dimensional CRSs contain an
additional z
value, typically representing height above sea
level). sf
objects provide both a geometry
information, describing where on Earth the feature is located, and
attributes information, describing other properties (like the
population of the region, the unemployment rate, etc.).
data.frame
objects store only attributes information.↩︎