This vignette serves as a quick-start guide for determining the
optimal number of trees in a random forest using the optRF
package. The optRF
package was created to model the
relationship between the number of trees and the resulting stability of
the random forest and to use this model to determine the optimal number
of trees from where on adding additional trees to the random forest
would increase the computation time but would only marginally increase
the stability. The optRF
package depends on
ranger
for efficient random forest implementation. To
demonstrate its functionality, we will use the SNPdata
data
set included in the optRF
package. The SNPdata
data set contains in the first column the yield of 250 wheat individuals
as well as 5000 genomic markers (so called SNP markers) that can contain
either the value 0 or 2:
opt_prediction
functionTo determine the optimal number of trees for random forest
predictions, the opt_prediction
function from the
optRF
package can be used. For a detailed description of
the opt_prediction
function, please refer to its vignette
.
As an example, let’s use the first 200 rows of the
SNPdata
data set as the training data and the last 50 rows
as the test data:
Training = SNPdata[1:200,] # Rows 1 to 200 as training data
Test = SNPdata[201:250,-1] # Rows 201 to 250 as test data, excluding the response column (column 1)
To make the code reproducible, we will set a seed of 123
before using the opt_prediction
function. This function
identifies the optimal number of trees required for random forest to
predict the response variable in the test data set. The response
variable from the training data set is passed to the y
argument, the predictor variables from the training data set are
inserted in the X
argument, and the predictor variables
from the test data set are specified in the X_Test
argument. Let’s also assume that we aim to select the 5 top performing
individuals from the test data set which translates to the top 10% which
has to be inserted in the alpha
argument:
set.seed(123) # Set a seed for reproducibility
optRF_result = opt_prediction(y=Training[,1], X=Training[,-1],
X_Test=Test, alpha=0.1)
After running the opt_prediction
function, the
recommended number of trees and the expected stability of the random
forest model can be easily accessed using the summary
function:
summary(optRF_result)
#> Recommended number of trees: 19000
#> Expected prediction stability: 0.9780626
#> Expected selection stability: 0.7137072
#> Expected computation time (sec): 1.662468
Here, the optimal number of trees to receive stable predictions is 19,000 trees. Using random forest with this number of trees, one can expect a prediction stability of 0.98 which describes the correlation of the predicted response in repeated runs of random forest.
With the optimal number of trees identified, random forest can now be
used to predict the response variable in the test data set. For this,
the ranger
function can be used:
RF_model = ranger(y=Training[,1], x=Training[,-1],
write.forest = TRUE, num.trees=optRF_result$recommendation)
predictions = predict(RF_model, data=Test)$predictions
predicted_Test_data = data.frame(ID = row.names(Test), predicted_response = predictions)
The predictions for the test data set are now stored in the
predicted_Test_data
data frame which includes the IDs and
their corresponding predicted responses.
Predicted values from random forest can support decision-making, such as selecting the top-performing individuals in a data set. To proceed, it is essential to define the number of top-performing individuals to be selected. For instance, suppose we aim to select the top 5 performers from a test data set containing 50 individuals.
To evaluate the stability of random forest in supporting this
selection decision, repeated runs of the model are performed. In each
run, the top individuals are classified as selected
, while
the remaining individuals are classified as rejected
. The
selection stability is then quantified using Fleiss’ Kappa, which
measures the consistency of these classifications across the repeated
runs.
To determine the optimal number of trees for selection, the argument
recommendation
has to be set to
"selection"
:
set.seed(123) # Set a seed for reproducibility
optRF_result_2 = opt_prediction(y=Training[,1], X=Training[,-1], X_Test=Test,
alpha=0.1, recommendation="selection")
The recommended number of trees and the expected stability can be accessed as above:
summary(optRF_result_2)
#> Recommended number of trees: 73000
#> Expected prediction stability: 0.9931893
#> Expected selection stability: 0.837301
#> Expected computation time (sec): 7.015361
One can see that the recommended number of trees was increased to
73,000 trees. Random forest can now be used with this number of trees to
ensure stable predictions with a prediction stability of 0.993 and
prediction based decisions with a selection stability of 0.837. In order
to demonstrate the robustness of the predictions and decision-making
process, the decisions from random forest should always be published
together with the prediction and selection stability which can be seen
in the output from optRF_result$expected_RF_stability
above.
opt_importance
functionIn addition to predictions, random forest can be used to estimate
variable importance and select the most relevant variables in a data
set. For variable selection, the opt_importance
function
from the optRF
package can help determine the optimal
number of trees for random forest. For a detailed explanation of the
opt_importance
function, please refer to its vignette
.
To make the code reproducible, we will set a seed of 123
before using the opt_importance
function. This function
identifies the optimal number of trees required to reliably estimate the
importance of variables in a data set:
set.seed(123) # Set a seed for reproducibility
optRF_result = opt_importance(y=SNPdata[,1], X=SNPdata[,-1])
Once the opt_importance
function is executed, the
recommended number of trees and the stability metrics can be easily
accessed:
summary(optRF_result)
#> Recommended number of trees: 40000
#> Expected variable importance stability: 0.958165
#> Expected selection stability: 0.5856996
#> Expected computation time (sec): 6.641934
The results indicate that the optimal number of trees for estimating variable importance is 40,000. With this configuration, the variable importance stability is 0.96 which reflects the correlation of variable importance estimates across repeated runs of random forest.
With the optimal number of trees determined, random forest can now be
used to estimate variable importance. This can be achieved using the
ranger
function with permutation importance:
RF_model = ranger(y=SNPdata[,1], x=SNPdata[,-1], num.trees=optRF_result$recommendation,
write.forest = TRUE, importance="permutation")
D_VI = data.frame(variable = names(SNPdata)[-1],
importance = RF_model$variable.importance)
The variable importance results are now saved in the data frame
D_VI
which contains the variable names and the
corresponding importance estimates.
Now that we have obtained stable estimates of variable importance, we can use these estimates to select the most important variables from the data set. However, before proceeding, it is necessary to estimate how many variables in the data set significantly affect the response variable.
One approach to achieve this is by visualising the distribution of variable importance estimates. A histogram provides an intuitive way to identify patterns such as a clear separation between important and less important variables:
From the histogram, it is apparent that the majority of variables have importance values between -5 and 5. However, a small subset of variables exceeds an importance threshold of 5. These variables are likely to have a genuine impact on the response and should be selected for further analysis.
Based on the histogram, we calculate the number of variables with importance values greater than 5:
Using this number of selected variables, the optimal number of trees
for random forest can be determined with the opt_importance
function:
set.seed(123) # Set a seed for reproducibility
optRF_result_2 = opt_importance(y=SNPdata[,1], X=SNPdata[,-1],
recommendation = "selection",
alpha = selection_size)
With the recommended number of trees, we re-estimate variable importance and select the top variables:
RF_model_2 = ranger(y=SNPdata[,1], x=SNPdata[,-1], num.trees=optRF_result_2$recommendation,
write.forest = TRUE, importance="permutation")
D_VI_2 = data.frame(variable = names(SNPdata)[-1],
importance = RF_model_2$variable.importance)
D_VI_2 = D_VI_2[order(D_VI_2$importance, decreasing=TRUE),]
selected_variables = D_VI_2[1:selection_size,1]
The vector selected_variables
now contains the names of
the most important variables in the data set. To enhance reproducibility
and reliability, the selection results should be published along with
the variable importance stability and selection stability metrics. These
can again be easily accessed:
Random forest can be effectively used to predict the response variable in a test data set where only the predictor variables are known, and the predicted response values can be used to identify the top-performing individuals. Similarly, random forest can estimate the importance of variables in a data set and facilitate the selection of the most important variables. However, it must be noted that random forest is a non-deterministic method, meaning it can yield different results even when run on the same data set. To ensure the reproducibility and reliability of the results, it is essential to determine the optimal number of trees for random forest and use this to guide decision-making. Researchers are encouraged to evaluate the stability of their chosen tree count to ensure reliable and reproducible results when using random forest.