The PPforest
package (projection pursuit random forest)
contains functions to run a projection pursuit random forest for
classification problems. This method utilize combinations of variables
in each tree construction. In a random forest each split is based on a
single variable, chosen from a subset of predictors. In the
PPforest
, each split is based on a linear combination of
randomly chosen variables. The linear combination is computed by
optimizing a projection pursuit index, to get a projection of the
variables that best separates the classes. The PPforest
uses the PPtree
algorithm, which fits a single tree to the
data. Utilizing linear combinations of variables to separate classes
takes the correlation between variables into account, and can outperform
the basic forest when separations between groups occurs on combinations
of variables. Two projection pursuit indexes, LDA and PDA, are used for
PPforest
.
To improve the speed performance PPforest
package,
PPtree
algorithm was translated to Rcpp.
PPforest
package utilizes a number of R packages some of
them included in “suggests” not to load them all at package
start-up.
You can install the package from CRAN:
install.package(PPforest)
library(PPforest)
Or the development version of PPforest
can be installed
from github using:
library(devtools)
install_github("natydasilva/PPforest")
library(PPforest)
##Projection pursuit classification forest
In PPforest
, projection pursuit classification trees are
used as the individual model to be combined in the forest. The original
algorithm is in PPtreeViz
package, we translate the
original tree algorithm into Rcpp
to improve the speed
performance to run the forest.
One important characteristic of PPtree is that treats the data always as a two-class system, when the classes are more than two the algorithm uses a two step projection pursuits optimization in every node split. Let \((X_i,y_i)\) the data set, \(X_i\) is a p-dimensional vector of explanatory variables and \(y_i\in {1,2,\ldots G}\) represents class information with \(i=1,\ldots n\).
In the first step optimize a projection pursuit index to find an optimal one-dimension projection \(\alpha^*\) for separating all classes in the current data. With the projected data redefine the problem in a two class problem by comparing means, and assign a new label \(G1\) or \(G2\) to each observation, a new variable \(y_i^*\) is created. The new groups \(G1\) and \(G2\) can contain more than one original classes. Next step is to find an optimal one-dimensional projection \(\alpha\), using \((X_i,y_i^*)\) to separate the two class problem \(G1\) and \(G2\). The best separation of \(G1\) and \(G2\) is determine in this step and the decision rule is defined for the current node, if \(\sum_{i=1}^p \alpha_i M1< c\) then assign \(G1\) to the left node else assign \(G2\) to the right node, where \(M1\) is the mean of \(G1\). For each groups we can repeat all the previous steps until \(G1\) and \(G2\) have only one class from the original classes. Base on this process to grow the tree, the depth of PPtree is at most the number of classes because one class is assigned only to one final node.
Trees from PPtree
algorithm are simple, they use the
association between variables to find separation. If a linear boundary
exists, PPtree
produces a tree without
misclassification.
Projection pursuit random forest algorithm description
Let N the number of cases in the training set \(\Theta=(X,Y)\), \(B\) bootstrap samples from the training set are taking (samples of size N with replacement).
For each bootstrap sample a \verb PPtree is grown to the largest extent possible \(h(x, {\Theta_k})\). No pruning. This tree is grown using step 3 modification.
Let M the number of input variables, a number of \(m<<M\) variables are selected at random at each node and the best split based on a linear combination of these randomly chosen variables. The linear combination is computed by optimizing a projection pursuit index, to get a projection of the variables that best separates the classes.
Predict the classes of each case not included in the bootstrap sample and compute oob error.
Based on majority vote predict the class for new data.
###Overview PPforest package
PPforest
package implements a classification random
forest using projection pursuit classification trees. The following
table present all the functions in PPforest
package.
Function | Description |
---|---|
baggtree | For each bootstrap sample grow a projection pursuit tree (PPtree object). |
node_data | Data structure with the projected and boundary by node and class |
permute_importance | Obtain the permuted importance variable measure |
ppf_avg_imp | Computes a global importance measure for a PPforest object, average importance measure for a pptree over all the trees. |
PPclassify | Predict class for the test set and calculate prediction error after finding the PPtree structure |
ppf_global_imp | Computes a global importance measure for a PPforest object |
PPforest | Runs a Projection pursuit random forest |
PPtree_split | Projection pursuit classification tree with random variable selection in each split |
print.PPforest | Print PPforest object |
predict.PPforest | Predict class for the test set and calculate prediction error |
ternary_str | Data structure with the projected and boundary by node and class |
tree_pred | Obtain predicted class for new data from baggtree function or PPforest |
Also PPforest
package includes some data set that were
used to test the predictive performance of our method. The data sets
included are: crab, fishcatch, glass, image, leukemia, lymphoma NCI60,
parkinson and wine.
Australian crab data set will be used as example. This data contains measurements on rock crabs of the genus Leptograpsus. There are 200 observations from two species (blue and orange) and for each specie (50 in each one) there are 50 males and 50 females. Class variable has 4 classes with the combinations of specie and sex (BlueMale, BlueFemale, OrangeMale and OrangeFemale). The data were collected on site at Fremantle, Western Australia. For each specimen, five measurements were made, using vernier calipers.
To visualize this data set we use a scatterplot matrix from the
package GGally
In this figure we can see a strong, positive and linear association between the different variables. Also look like the classes can be separated by linear combinations.
The main function of the package is PPforest
which
implements a projection pursuit random forest.
PPtree_split
this function implements a projection
pursuit classification tree with random variable selection in each
split, based on the original PPtreeViz algorithm. This function returns
a PPtreeclass
object. To use this function we need to
specify a formula describing the model to be fitted response~predictors
(form
), data
is a data frame with the complete
data set. Also we need to specify the method PPmethod
, it
is the index to use for projection pursuit: ‘LDA’ or ‘PDA’,
size.p
is the proportion of variables randomly sampled in
each split. If size.p = 1 a classic PPtreeclass
object will
be fitted using all the variables in each node partition instead of a
subset of them. lambda
penalty parameter in PDA index and
is between 0 to 1 . he following example fits a projection pursuit
classification tree constructed using 0.6 of the variables (3 out of 5)
in each node split. We selected LDA
method.
<- PPforest::PPtree_split("Type~.", data = crab, PPmethod = "LDA", size.p = 0.6)
Tree.crab Tree.crab
## =============================================================
## Projection Pursuit Classification Tree result
## =============================================================
##
## 1) root
## 2) proj1*X < cut1
## 4)* proj2*X < cut2 -> "2"
## 5)* proj2*X >= cut2 -> "1"
## 3) proj1*X >= cut1
## 6)* proj3*X < cut3 -> "4"
## 7)* proj3*X >= cut3 -> "3"
##
## Error rates of various cutoff values
## -------------------------------------------------------------
## Rule1 Rule2 Rule3 Rule4 Rule5 Rule6 Rule7 Rule8
## error.rate 0.065 0.065 0.065 0.065 0.075 0.075 0.075 0.075
PPforest
function runs a projection pursuit random
forest. The arguments are a data frame with the data information, class
with the name of the class variable argument. size.tr to specify the
proportion of observations using in the training. Using this function we
have the option to split the data in training and test using size.tr
directly. size.tr
is the proportion of data used in the
training and the test proportion will be 1- size.tr
. The
number of trees in the forest is specified using the argument
m
. The argument size.p is the sample proportion of the
variables used in each node split, PPmethod
is the
projection pursuit index to be optimized, two options LDA and PDA are
available.
<- PPforest::PPforest(data = crab, class = "Type", size.tr = .8, m = 200,
pprf.crab size.p = .5, PPmethod = 'LDA', parallel =FALSE, cores = 2)
pprf.crab
##
## Call:
## PPforest::PPforest(data = crab, class = "Type", size.tr = 0.8, m = 200, PPmethod = "LDA", size.p = 0.5, parallel = FALSE, cores = 2)
## Type of random forest: Classification
## Number of trees: 200
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 4.37%
## Confusion matrix:
## BlueFemale BlueMale OrangeFemale OrangeMale class.error
## BlueFemale 38 2 0 0 0.05
## BlueMale 3 37 0 0 0.07
## OrangeFemale 0 0 39 1 0.03
## OrangeMale 0 1 0 39 0.03
PPforest
print a summary result from the model with the
confusion matrix information and the oob-error rate in a similar way
randomForest packages does.
This function returns the predicted values of the training data, training error, test error and predicted test values. Also there is the information about out of bag error for the forest and also for each tree in the forest. Bootstrap samples, output of all the trees in the forest from , proximity matrix and vote matrix, number of trees grown in the forest, number of predictor variables selected to use for splitting at each node. Confusion matrix of the prediction (based on OOb data), the training data and test data and vote matrix are also returned.
The printed version of a PPforest
object follows the
randomForest
printed version to make them comparable. Based
on confusion matrix, we can observe that the biggest error is for
BlueMale class. Most of the wrong classified values are between
BlueFemale and BlueMale.
The output from a PPforest
object contains a lot of
information as we can see in the next output.
str(pprf.crab, max.level = 1 )
## List of 21
## $ predicting.training: Factor w/ 4 levels "BlueFemale","BlueMale",..: 2 2 2 2 1 2 2 2 1 2 ...
## $ training.error : num 0.0375
## $ prediction.test : Factor w/ 4 levels "BlueFemale","BlueMale",..: 2 2 1 2 2 2 2 2 2 2 ...
## $ error.test : num 0.15
## $ oob.error.forest : num 0.0437
## $ oob.error.tree : num [1:200, 1] 0.1695 0.2679 0.0317 0.1207 0.3273 ...
## $ boot.samp :List of 200
## $ output.trees :List of 200
## $ proximity : num [1:160, 1:160] 0 0.8 0.87 0.815 0.39 0.9 0.78 0.51 0.395 0.705 ...
## $ votes : num [1:160, 1:4] 0.222 0.333 0.191 0.222 0.661 ...
## ..- attr(*, "dimnames")=List of 2
## $ prediction.oob : Factor w/ 4 levels "BlueFemale","BlueMale",..: 2 2 2 2 1 2 2 2 1 2 ...
## $ n.tree : num 200
## $ n.var : num 2
## $ type : chr "Classification"
## $ confusion : num [1:4, 1:5] 38 3 0 0 2 37 0 1 0 0 ...
## ..- attr(*, "dimnames")=List of 2
## $ call : language PPforest::PPforest(data = crab, class = "Type", size.tr = 0.8, m = 200, PPmethod = "LDA", size.p = 0.5, para| __truncated__
## $ train :'data.frame': 160 obs. of 6 variables:
## $ test :'data.frame': 40 obs. of 5 variables:
## $ vote.mat : num [1:200, 1:160] 1 1 2 2 1 1 2 4 4 2 ...
## ..- attr(*, "dimnames")=List of 2
## $ class.var : chr "Type"
## $ oob.obs : num [1:200, 1:160] 0 1 1 0 0 0 0 1 1 1 ...
## - attr(*, "class")= chr "PPforest"
For example to get the predicted values for the test data we can use the PPforest output:
$prediction.test pprf.crab
## [1] BlueMale BlueMale BlueFemale BlueMale BlueMale
## [6] BlueMale BlueMale BlueMale BlueMale BlueMale
## [11] BlueMale BlueFemale BlueFemale BlueFemale BlueFemale
## [16] BlueFemale BlueFemale BlueFemale BlueFemale BlueFemale
## [21] OrangeMale OrangeMale OrangeMale OrangeMale OrangeMale
## [26] OrangeMale OrangeMale OrangeMale OrangeMale OrangeMale
## [31] OrangeMale OrangeMale OrangeMale OrangeMale OrangeFemale
## [36] OrangeFemale OrangeFemale OrangeFemale OrangeFemale OrangeFemale
## Levels: BlueFemale BlueMale OrangeFemale OrangeMale
If new data are available you can use the function
trees_pred
to get the predicted classes by PPforest
object.
trees_pred(pprf.crab, xnew = newdata, parallel = TRUE)
The PPforest algorithm calculates variable importance in two ways: (1) permuted importance using accuracy, and (2) importance based on projection coefficients on standardized variables.
The permuted variable importance is comparable with the measure defined in the classical random forest algorithm. It is computed using the out of bag (oob) sample for the tree \(k\;\;(B^{(k)})\) for each \(X_j\) predictor variable. Then the permuted importance of the variable \(X_j\) in the tree \(k\) can be defined as:
\[ IMP^{(k)}(X_j) = \frac{\sum_{i \in B^{(k)} } I(y_i=\hat y_i^{(k)})-I(y_i=\hat y_{i,P_j}^{(k)})}{|B^{(k)}|} \]
where \(\hat y_i^{(k)}\) is the predicted class for the observation \(i\) in the tree \(k\) and \(y_{i,P_j}^{(k)}\) is the predicted class for the observation \(i\) in the tree \(k\) after permuting the values for variable \(X_j\). The global permuted importance measure is the average importance over all the trees in the forest. This measure is based on comparing the accuracy of classifying out-of-bag observations, using the true class with permuted (nonsense) class. To compute this measure you should use permute_importance function.
<- permute_importance(pprf.crab)
impo1 impo1
## nm imp sd.imp imp2 sd.imp2 imp2.std imp.std
## 1 CW 17.975 9.342418 0.3104134 0.1596670 1.944129 1.924020
## 2 BD 18.470 10.632703 0.3199451 0.1844948 1.734169 1.737094
## 3 CL 20.505 10.005023 0.3559586 0.1728535 2.059308 2.049471
## 4 FL 20.805 11.036576 0.3594990 0.1874475 1.917865 1.885096
## 5 RW 21.195 9.461213 0.3668738 0.1613988 2.273088 2.240199
This function returns a data frame with permuted importance measures, imp is the permuted importance measure defined in Brieman paper, imp2 is the permuted importance measure defined in randomForest package, the standard deviation (sd.im and sd.imp2) for each measure is computed and the also the standardized measure.
For the second importance measure, the coefficients of each projection are examined. The magnitude of these values indicates importance, if the variables have been standardized. The variable importance for a single tree is computed by a weighted sum of the absolute values of the coefficients across nodes. The weights takes the number of classes in each node into account (Lee et al. 2013). Then the importance of the variable \(X_j\) in the PPtree \(k\) can be defined as:
\[ IMP_{pptree}^{(k)}(X_j)=\sum_{nd = 1}^{nn}\frac{|\alpha_{nd}^{(k)}|}{cl_{nd} } \]
Where \(\alpha_{nd}^{(k)}\) is the projected coefficient for node \(ns\) and variable \(k\) and \(nn\) the total number of node partitions in the tree \(k\).
The global variable importance in a PPforest then can be defined in different ways. The most intuitive is the average variable importance from each PPtree across all the trees in the forest.
\[ IMP_{ppforest1}(X_j)=\frac{\sum_{k=1}^K IMP_{pptree}^{(k)}(X_j)}{K} \]
Alternatively we have defined a global importance measure for the forest as a weighted mean of the absolute value of the projection coefficients across all nodes in every tree. The weights are based on the projection pursuit indexes in each node (\(Ix_{nd}\)), and 1-(OOB-error of each tree)(\(acc_k\)).
\[IMP_{ppforest2}(X_j)=\frac{\sum_{k=1}^K acc_k \sum_{nd = 1}^{nn}\frac{Ix_{nd}|\alpha_{nd}^{(k)}|}{nn }}{K} \]
<- ppf_avg_imp(pprf.crab, "Type")
impo2 impo2
## # A tibble: 5 × 2
## variable mean
## <fct> <dbl>
## 1 CL 0.462
## 2 CW 0.452
## 3 RW 0.385
## 4 BD 0.313
## 5 FL 0.279
Finally you can get the last importance measure we have proposed for the PPforest using `ppf_global_imp’ function.
<- ppf_global_imp(data = crab, class = "Type", pprf.crab)
impo3 impo3
## # A tibble: 5 × 2
## variable mean
## <fct> <dbl>
## 1 CW 0.419
## 2 CL 0.384
## 3 RW 0.335
## 4 BD 0.281
## 5 FL 0.236
Using the information available in the PPforest object, some visualization can be done. I will include some useful examples to visualize the data and some of the most important diagnostics in a forest structure.
To describe the data structure a parallel plot can be done, the data were standardized and the color represents the class variable.
ternary_str
is an auxiliary functions in
PPforest
to get the data structure needed to do a ternary
plot or a generalized ternary plot if more than 3 classes are available.
Because the PPforest is composed of many tree fits on subsets of the
data, a lot of statistics can be calculated to analyze as a separate
data set, and better understand how the model is working. Some of the
diagnostics of interest are: variable importance, OOB error rate, vote
matrix and proximity matrix.
With a decision tree we can compute for every pair of observations the proximity matrix. This is a \(nxn\) matrix where if two cases \(k_i\) and \(k_j\) are in the same terminal node increase their proximity by one, at the end normalize the proximities by dividing by the number of trees. To visualize the proximity matrix we use a scatter plot with information from multidimensional scaling method. In this plot color indicates the true species and sex. For this data two dimensions are enough to see the four groups separated quite well. Some crabs are clearly more similar to a different group, though, especially in examining the sex differences.
The vote matrix (\(n \times p\)) contains the proportion of times each observation was classified to each class, whole oob. Two possible approaches to visualize the vote matrix information are shown, with a side-by-side jittered dot plot or with ternary plots. A side-by-side jittered dotplot is used for the display, where class is displayed on one axis and proportion is displayed on the other. For each dotplot, the ideal arrangement is that points of observations in that class have values bigger than 0.5, and all other observations have less. This data is close to the ideal but not perfect, e.g. there are a few blue male crabs (orange) that are frequently predicted to be blue females (green), and a few blue female crabs predicted to be another class.
A ternary plot is a triangular diagram that shows the proportion of three variables that sum to a constant and is done using barycentric coordinates. Compositional data lies in a \((p-1)\)-D simplex in \(p\)-space. One advantage of ternary plot is that are good to visualize compositional data and the proportion of three variables in a two dimensional space can be shown. When we have tree classes a ternary plot are well defined. With more than tree classes the ternary plot idea need to be generalized.@sutherland2000orca suggest the best approach to visualize compositional data will be to project the data into the \((p-1)-\)D space (ternary diagram in \(2-D\)) This will be the approach used to visualize the vote matrix information.
A ternary plot is a triangular diagram used to display compositional data with three components. More generally, compositional data can have any number of components, say \(p\), and hence is contrained to a \((p-1)\)-D simplex in \(p\)-space. The vote matrix is an example of compositional data, with \(G\) components.
To see a complete description about how to visualize a PPforest object read Interactive Graphics for Visually Diagnosing Forest Classifiers in R (Silva, Cook, and Lee 2017).