HiClimR
:
Hierarchical Climate
Regionalization
HiClimR
is a tool for Hierarchical Climate
Regionalization applicable to any correlation-based
clustering. Climate regionalization is the process of dividing an area
into smaller regions that are homogeneous with respect to a specified
climatic metric. Several features are added to facilitate the
applications of climate regionalization (or spatiotemporal analysis in
general) and to implement a cluster validation function with an
objective tree cutting to find an optimal number of clusters for a
user-specified confidence level. These include options for preprocessing
and postprocessing as well as efficient code execution for large
datasets and options for splitting big data and computing only the
upper-triangular half of the correlation/dissimilarity matrix to
overcome memory limitations. Hybrid hierarchical clustering reconstructs
the upper part of the tree above a cut to get the best of the available
methods. Multivariate clustering (MVC) provides options for filtering
all variables before preprocessing, detrending and standardization of
each variable, and applying weights for the preprocessed variables.
HiClimR
adds several features and a new clustering method (called,
regional
linkage) to hierarchical clustering in R
(hclust
function in stats
library)
including:
BLAS
library on 64-Bit machines
ATLAS
OpenBLAS
Intel MKL
regional
linkage or minimum inter-regional
correlationward
’s minimum variance or error sum of squares
methodsingle
linkage or nearest neighbor methodcomplete
linkage or diameteraverage
linkage, group average, or UPGMA methodmcquitty
’s or WPGMA methodmedian
, Gower’s or WPGMC methodcentroid
or UPGMC methodregional
linkage
methodThe regional
linkage method is explained in the context
of a spatiotemporal problem, in which N
spatial elements
(e.g., weather stations) are divided into k
regions, given
that each element has a time series of length M
. It is
based on inter-regional correlation distance between the temporal means
of different regions (or elements at the first merging step). It
modifies the update formulae of average
linkage method by
incorporating the standard deviation of the merged region timeseries,
which is a function of the correlation between the individual regions,
and their standard deviations before merging. It is equal to the average
of their standard deviations if and only if the correlation between the
two merged regions is 100%
. In this special case, the
regional
linkage method is reduced to the classic
average
linkage clustering method.
Badr et
al. (2015) describes the regionalization algorithms, features, and
data processing tools included in the package and presents a
demonstration application in which the package is used to regionalize
Africa on the basis of interannual precipitation variability. The figure
below shows a detailed flowchart for the package. Cyan
blocks represent helper functions, green
is input data or
parameters, yellow
indicates agglomeration Fortran code,
and purple
shows graphics options. For multivariate
clustering (MVC), the input data is a list of matrices (one matrix for
each variable with the same number of rows to be clustered; the number
of columns may vary per variable). The blue dashed boxes involve a loop
for all variables to apply mean and/or variance thresholds, detrending,
and/or standardization per variable before weighing the preprocessed
variables and binding them by columns in one matrix for clustering.
x
is the input N x M
data matrix,
xc
is the coarsened N0 x M
data matrix where
N0 ≤ N
(N0 = N
only if
lonStep = 1
and latStep = 1
), xm
is the masked and filtered N1 x M1
data matrix where
N1 ≤ N0
(N1 = N0
only if the number of masked
stations/points is zero) and M1 ≤ M
(M1 = M
only if no columns are removed due to missing values), and
x1
is the reconstructed N1
x M1
data matrix if PCA is performed.
HiClimR
is applicable to any correlation-based clustering.
There are many ways to install an R package from precompiled binaries
or source code. For more details, you may search for how to install an R
package, but here are the most convenient ways to install HiClimR
:
This is the easiest way to install an R package on Windows, Mac, or Linux. You just fire up an R shell and type:
install.packages("HiClimR")
In theory the package should just install, however, you may be asked
to select your local mirror (i.e. which server should you use to
download the package). If you are using R-GUI or
R-Studio, you can find a menu for package installation
where you can just search for HiClimR
and install it.
This is intended for developers and requires a development
environment (compilers, libraries, … etc) to install the latest
development release of HiClimR
.
On Linux and Mac, you can download the
source code and use R CMD INSTALL
to install it. In a
convenient way, you may use pak
as follows:
pak
from CRAN:install.packages("pak")
Make sure you have a working development environment:
Rtools
.Install HiClimR
from GitHub
source:
::pkg_install("hsbadr/HiClimR") pak
The source code repository can be found on GitHub at hsbadr/HiClimR.
HiClimR
is licensed under GPL v3
. The
code is modified by Hamada S.
Badr from src/library/stats/R/hclust.R
part of R package
Copyright © 1995-2021 The R Core Team.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
A copy of the GNU General Public License is available at https://www.r-project.org/Licenses.
Copyright © 2013-2021 Earth and Planetary Sciences (EPS), Johns Hopkins University (JHU).
To cite HiClimR in publications, please use:
citation("HiClimR")
Hamada S. Badr, Zaitchik, B. F. and Dezfuli, A. K. (2015): A Tool for Hierarchical Climate Regionalization, Earth Science Informatics, 8(4), 949-958, https://doi.org/10.1007/s12145-015-0221-7.
Hamada S. Badr, Zaitchik, B. F. and Dezfuli, A. K. (2014): HiClimR: Hierarchical Climate Regionalization, Comprehensive R Archive Network (CRAN), https://cran.r-project.org/package=HiClimR.
Version | Date | Comment | Author | |
---|---|---|---|---|
May 1992 | Original | F. Murtagh | ||
Dec 1996 | Modified | Ross Ihaka | ||
Apr 1998 | Modified | F. Leisch | ||
Jun 2000 | Modified | F. Leisch | ||
1.0.0 | 03/07/14 | HiClimR | Hamada S. Badr | badr@jhu.edu |
1.0.1 | 03/08/14 | Updated | Hamada S. Badr | badr@jhu.edu |
1.0.2 | 03/09/14 | Updated | Hamada S. Badr | badr@jhu.edu |
1.0.3 | 03/12/14 | Updated | Hamada S. Badr | badr@jhu.edu |
1.0.4 | 03/14/14 | Updated | Hamada S. Badr | badr@jhu.edu |
1.0.5 | 03/18/14 | Updated | Hamada S. Badr | badr@jhu.edu |
1.0.6 | 03/25/14 | Updated | Hamada S. Badr | badr@jhu.edu |
1.0.7 | 03/30/14 | Hybrid | Hamada S. Badr | badr@jhu.edu |
1.0.8 | 05/06/14 | Updated | Hamada S. Badr | badr@jhu.edu |
1.0.9 | 05/07/14 | CRAN | Hamada S. Badr | badr@jhu.edu |
1.1.0 | 05/15/14 | Updated | Hamada S. Badr | badr@jhu.edu |
1.1.1 | 07/14/14 | Updated | Hamada S. Badr | badr@jhu.edu |
1.1.2 | 07/26/14 | Updated | Hamada S. Badr | badr@jhu.edu |
1.1.3 | 08/28/14 | Updated | Hamada S. Badr | badr@jhu.edu |
1.1.4 | 09/01/14 | Updated | Hamada S. Badr | badr@jhu.edu |
1.1.5 | 11/12/14 | Updated | Hamada S. Badr | badr@jhu.edu |
1.1.6 | 03/01/15 | GitHub | Hamada S. Badr | badr@jhu.edu |
1.2.0 | 03/27/15 | MVC | Hamada S. Badr | badr@jhu.edu |
1.2.1 | 05/24/15 | Updated | Hamada S. Badr | badr@jhu.edu |
1.2.2 | 07/21/15 | Updated | Hamada S. Badr | badr@jhu.edu |
1.2.3 | 08/05/15 | Updated | Hamada S. Badr | badr@jhu.edu |
2.0.0 | 12/22/18 | NOTE | Hamada S. Badr | badr@jhu.edu |
2.1.0 | 01/01/19 | NetCDF | Hamada S. Badr | badr@jhu.edu |
2.1.1 | 01/02/19 | Updated | Hamada S. Badr | badr@jhu.edu |
2.1.2 | 01/04/19 | Updated | Hamada S. Badr | badr@jhu.edu |
2.1.3 | 01/10/19 | Updated | Hamada S. Badr | badr@jhu.edu |
2.1.4 | 01/20/19 | Updated | Hamada S. Badr | badr@jhu.edu |
2.1.5 | 12/10/19 | inherits | Hamada S. Badr | badr@jhu.edu |
2.1.6 | 02/22/20 | Updated | Hamada S. Badr | badr@jhu.edu |
2.1.7 | 11/05/20 | Updated | Hamada S. Badr | badr@jhu.edu |
2.1.8 | 01/04/21 | Updated | Hamada S. Badr | badr@jhu.edu |
library(HiClimR)
#----------------------------------------------------------------------------------#
# Typical use of HiClimR for single-variate clustering: #
#----------------------------------------------------------------------------------#
## Load the test data included/loaded in the package (1 degree resolution)
<- TestCase$x
x <- TestCase$lon
lon <- TestCase$lat
lat
## Generate/check longitude and latitude mesh vectors for gridded data
<- grid2D(lon = unique(TestCase$lon), lat = unique(TestCase$lat))
xGrid <- c(xGrid$lon)
lon <- c(xGrid$lat)
lat
## Single-Variate Hierarchical Climate Regionalization
<- HiClimR(x, lon = lon, lat = lat, lonStep = 1, latStep = 1, geogMask = FALSE,
y continent = "Africa", meanThresh = 10, varThresh = 0, detrend = TRUE,
standardize = TRUE, nPC = NULL, method = "ward", hybrid = FALSE, kH = NULL,
members = NULL, nSplit = 1, upperTri = TRUE, verbose = TRUE,
validClimR = TRUE, k = 12, minSize = 1, alpha = 0.01,
plot = TRUE, colPalette = NULL, hang = -1, labels = FALSE)
#----------------------------------------------------------------------------------#
# Additional Examples: #
#----------------------------------------------------------------------------------#
## Use Ward's method
<- HiClimR(x, lon = lon, lat = lat, lonStep = 1, latStep = 1, geogMask = FALSE,
y continent = "Africa", meanThresh = 10, varThresh = 0, detrend = TRUE,
standardize = TRUE, nPC = NULL, method = "ward", hybrid = FALSE, kH = NULL,
members = NULL, nSplit = 1, upperTri = TRUE, verbose = TRUE,
validClimR = TRUE, k = 5, minSize = 1, alpha = 0.01,
plot = TRUE, colPalette = NULL, hang = -1, labels = FALSE)
## Use data splitting for big data
<- HiClimR(x, lon = lon, lat = lat, lonStep = 1, latStep = 1, geogMask = FALSE,
y continent = "Africa", meanThresh = 10, varThresh = 0, detrend = TRUE,
standardize = TRUE, nPC = NULL, method = "ward", hybrid = TRUE, kH = NULL,
members = NULL, nSplit = 10, upperTri = TRUE, verbose = TRUE,
validClimR = TRUE, k = 12, minSize = 1, alpha = 0.01,
plot = TRUE, colPalette = NULL, hang = -1, labels = FALSE)
## Use hybrid Ward-Regional method
<- HiClimR(x, lon = lon, lat = lat, lonStep = 1, latStep = 1, geogMask = FALSE,
y continent = "Africa", meanThresh = 10, varThresh = 0, detrend = TRUE,
standardize = TRUE, nPC = NULL, method = "ward", hybrid = TRUE, kH = NULL,
members = NULL, nSplit = 1, upperTri = TRUE, verbose = TRUE,
validClimR = TRUE, k = 12, minSize = 1, alpha = 0.01,
plot = TRUE, colPalette = NULL, hang = -1, labels = FALSE)
## Check senitivity to kH for the hybrid method above
#----------------------------------------------------------------------------------#
# Typical use of HiClimR for multivariate clustering: #
#----------------------------------------------------------------------------------#
## Load the test data included/loaded in the package (1 degree resolution)
<- TestCase$x
x1 <- TestCase$lon
lon <- TestCase$lat
lat
## Generate/check longitude and latitude mesh vectors for gridded data
<- grid2D(lon = unique(TestCase$lon), lat = unique(TestCase$lat))
xGrid <- c(xGrid$lon)
lon <- c(xGrid$lat)
lat
## Test if we can replicate single-variate region map with repeated variable
<- HiClimR(x=list(x1, x1), lon = lon, lat = lat, lonStep = 1, latStep = 1,
y geogMask = FALSE, continent = "Africa", meanThresh = list(10, 10),
varThresh = list(0, 0), detrend = list(TRUE, TRUE), standardize = list(TRUE, TRUE),
nPC = NULL, method = "ward", hybrid = FALSE, kH = NULL,
members = NULL, nSplit = 1, upperTri = TRUE, verbose = TRUE,
validClimR = TRUE, k = 12, minSize = 1, alpha = 0.01,
plot = TRUE, colPalette = NULL, hang = -1, labels = FALSE)
## Generate a random matrix with the same number of rows
<- matrix(rnorm(nrow(x1) * 100, mean=0, sd=1), nrow(x1), 100)
x2
## Multivariate Hierarchical Climate Regionalization
<- HiClimR(x=list(x1, x2), lon = lon, lat = lat, lonStep = 1, latStep = 1,
y geogMask = FALSE, continent = "Africa", meanThresh = list(10, NULL),
varThresh = list(0, 0), detrend = list(TRUE, FALSE), standardize = list(TRUE, TRUE),
weightMVC = list(1, 1), nPC = NULL, method = "ward", hybrid = FALSE, kH = NULL,
members = NULL, nSplit = 1, upperTri = TRUE, verbose = TRUE,
validClimR = TRUE, k = 12, minSize = 1, alpha = 0.01,
plot = TRUE, colPalette = NULL, hang = -1, labels = FALSE)
## You can apply all clustering methods and options
#----------------------------------------------------------------------------------#
# Miscellaneous examples to provide more information about functionality and usage #
# of the helper functions that can be used separately or for other applications. #
#----------------------------------------------------------------------------------#
## Load test case data
<- TestCase$x
x
## Generate longitude and latitude mesh vectors
<- grid2D(lon = unique(TestCase$lon), lat = unique(TestCase$lat))
xGrid <- c(xGrid$lon)
lon <- c(xGrid$lat)
lat
## Coarsening spatial resolution
<- coarseR(x = x, lon = lon, lat = lat, lonStep = 2, latStep = 2)
xc <- xc$lon
lon <- xc$lat
lat <- xc$x
x
## Use fastCor function to compute the correlation matrix
<- proc.time(); xcor <- fastCor(t(x)); proc.time() - t0
t0 ## compare with cor function
<- proc.time(); xcor0 <- cor(t(x)); proc.time() - t0
t0
## Check the valid options for geographic masking
geogMask()
## geographic mask for Africa
<- geogMask(continent = "Africa", lon = lon, lat = lat, plot = TRUE,
gMask colPalette = NULL)
## Hierarchical Climate Regionalization Without geographic masking
<- HiClimR(x, lon = lon, lat = lat, lonStep = 1, latStep = 1, geogMask = FALSE,
y continent = "Africa", meanThresh = 10, varThresh = 0, detrend = TRUE,
standardize = TRUE, nPC = NULL, method = "ward", hybrid = FALSE, kH = NULL,
members = NULL, nSplit = 1, upperTri = TRUE, verbose = TRUE,
validClimR = TRUE, k = 12, minSize = 1, alpha = 0.01,
plot = TRUE, colPalette = NULL, hang = -1, labels = FALSE)
## With geographic masking (you may specify the mask produced above to save time)
<- HiClimR(x, lon = lon, lat = lat, lonStep = 1, latStep = 1, geogMask = TRUE,
y continent = "Africa", meanThresh = 10, varThresh = 0, detrend = TRUE,
standardize = TRUE, nPC = NULL, method = "ward", hybrid = FALSE, kH = NULL,
members = NULL, nSplit = 1, upperTri = TRUE, verbose = TRUE,
validClimR = TRUE, k = 12, minSize = 1, alpha = 0.01,
plot = TRUE, colPalette = NULL, hang = -1, labels = FALSE)
## With geographic masking and contiguity constraint
## Change contigConst as appropriate
<- HiClimR(x, lon = lon, lat = lat, lonStep = 1, latStep = 1, geogMask = TRUE,
y continent = "Africa", contigConst = 1, meanThresh = 10, varThresh = 0, detrend = TRUE,
standardize = TRUE, nPC = NULL, method = "ward", hybrid = FALSE, kH = NULL,
members = NULL, nSplit = 1, upperTri = TRUE, verbose = TRUE,
validClimR = TRUE, k = 12, minSize = 1, alpha = 0.01,
plot = TRUE, colPalette = NULL, hang = -1, labels = FALSE)
## Find minimum significant correlation at 95% confidence level
<- minSigCor(n = nrow(x), alpha = 0.05, r = seq(0, 1, by = 1e-06))
rMin
## Validtion of Hierarchical Climate Regionalization
<- validClimR(y, k = 12, minSize = 1, alpha = 0.01, plot = TRUE, colPalette = NULL)
z
## Apply minimum cluster size (minSize = 25)
<- validClimR(y, k = 12, minSize = 25, alpha = 0.01, plot = TRUE, colPalette = NULL)
z
## The optimal number of clusters, including small clusters
<- length(z$clustFlag)
k
## The selected number of clusters, after excluding small clusters (if minSize > 1)
<- sum(z$clustFlag)
ks
## Dendrogram plot
plot(y, hang = -1, labels = FALSE)
## Tree cut
<- cutree(y, k = k)
cutTree table(cutTree)
## Visualization for gridded data
<- matrix(y$region, nrow = length(unique(y$coords[, 1])), byrow = TRUE)
RegionsMap <- colorRampPalette(c("#00007F", "blue", "#007FFF", "cyan",
colPalette "#7FFF7F", "yellow", "#FF7F00", "red", "#7F0000"))
image(unique(y$coords[, 1]), unique(y$coords[, 2]), RegionsMap, col = colPalette(ks))
## Visualization for gridded or ungridded data
plot(y$coords[, 1], y$coords[, 2], col = colPalette(max(y$region, na.rm = TRUE))[y$region], pch = 15, cex = 1)
## Change pch and cex as appropriate!
## Export region map and mean timeseries into NetCDF-4 file
library(ncdf4)
<- HiClimR2nc(y=y, ncfile="HiClimR.nc", timeunit="years", dataunit="mm")
y.nc ## The NetCDF-4 file is still open to add other variables or close it
nc_close(y.nc)