Optimising random forest using optRF

Introduction

This vignette serves as a quick-start guide for determining the optimal number of trees in a random forest using the optRF package. The optRF package was created to model the relationship between the number of trees and the resulting stability of the random forest and to use this model to determine the optimal number of trees from where on adding additional trees to the random forest would increase the computation time but would only marginally increase the stability. The optRF package depends on ranger for efficient random forest implementation. To demonstrate its functionality, we will use the SNPdata data set included in the optRF package. The SNPdata data set contains in the first column the yield of 250 wheat individuals as well as 5000 genomic markers (so called SNP markers) that can contain either the value 0 or 2:

library(ranger)
library(optRF)
SNPdata[1:5, 1:5]
#>           Yield SNP_0001 SNP_0002 SNP_0003 SNP_0004
#> ID_001 670.7588        0        0        0        0
#> ID_002 542.5611        0        2        0        0
#> ID_003 591.6631        2        2        0        2
#> ID_004 476.3727        0        0        0        0
#> ID_005 635.9814        2        2        0        2

The opt_prediction function

Optimising random forest for predictions

To determine the optimal number of trees for random forest predictions, the opt_prediction function from the optRF package can be used. For a detailed description of the opt_prediction function, please refer to its vignette.

As an example, let’s use the first 200 rows of the SNPdata data set as the training data and the last 50 rows as the test data:

Training = SNPdata[1:200,] # Rows 1 to 200 as training data
Test = SNPdata[201:250,-1] # Rows 201 to 250 as test data, excluding the response column (column 1)

To make the code reproducible, we will set a seed of 123 before using the opt_prediction function. This function identifies the optimal number of trees required for random forest to predict the response variable in the test data set. The response variable from the training data set is passed to the y argument, the predictor variables from the training data set are inserted in the X argument, and the predictor variables from the test data set are specified in the X_Test argument. Let’s also assume that we aim to select the 5 top performing individuals from the test data set which translates to the top 10% which has to be inserted in the alpha argument:

set.seed(123) # Set a seed for reproducibility
optRF_result = opt_prediction(y=Training[,1], X=Training[,-1], 
                              X_Test=Test, alpha=0.1)

After running the opt_prediction function, the recommended number of trees and the expected stability of the random forest model can be easily accessed using the summary function:

summary(optRF_result)
#> Recommended number of trees: 19000
#> Expected prediction stability: 0.9780626
#> Expected selection stability: 0.7137072
#> Expected computation time (sec): 1.662468

Here, the optimal number of trees to receive stable predictions is 19,000 trees. Using random forest with this number of trees, one can expect a prediction stability of 0.98 which describes the correlation of the predicted response in repeated runs of random forest.

With the optimal number of trees identified, random forest can now be used to predict the response variable in the test data set. For this, the ranger function can be used:

RF_model = ranger(y=Training[,1], x=Training[,-1], 
                  write.forest = TRUE, num.trees=optRF_result$recommendation)
predictions = predict(RF_model, data=Test)$predictions
predicted_Test_data = data.frame(ID = row.names(Test), predicted_response = predictions)

The predictions for the test data set are now stored in the predicted_Test_data data frame which includes the IDs and their corresponding predicted responses.

Optimising random forest for prediction based decisions

Predicted values from random forest can support decision-making, such as selecting the top-performing individuals in a data set. To proceed, it is essential to define the number of top-performing individuals to be selected. For instance, suppose we aim to select the top 5 performers from a test data set containing 50 individuals.

To evaluate the stability of random forest in supporting this selection decision, repeated runs of the model are performed. In each run, the top individuals are classified as selected, while the remaining individuals are classified as rejected. The selection stability is then quantified using Fleiss’ Kappa, which measures the consistency of these classifications across the repeated runs.

To determine the optimal number of trees for selection, the argument recommendation has to be set to "selection":

set.seed(123) # Set a seed for reproducibility
optRF_result_2 = opt_prediction(y=Training[,1], X=Training[,-1], X_Test=Test,
                                alpha=0.1, recommendation="selection")

The recommended number of trees and the expected stability can be accessed as above:

summary(optRF_result_2)
#> Recommended number of trees: 73000
#> Expected prediction stability: 0.9931893
#> Expected selection stability: 0.837301
#> Expected computation time (sec): 7.015361

One can see that the recommended number of trees was increased to 73,000 trees. Random forest can now be used with this number of trees to ensure stable predictions with a prediction stability of 0.993 and prediction based decisions with a selection stability of 0.837. In order to demonstrate the robustness of the predictions and decision-making process, the decisions from random forest should always be published together with the prediction and selection stability which can be seen in the output from optRF_result$expected_RF_stability above.

The opt_importance function

Optimising random forest for variable importance estimation

In addition to predictions, random forest can be used to estimate variable importance and select the most relevant variables in a data set. For variable selection, the opt_importance function from the optRF package can help determine the optimal number of trees for random forest. For a detailed explanation of the opt_importance function, please refer to its vignette.

To make the code reproducible, we will set a seed of 123 before using the opt_importance function. This function identifies the optimal number of trees required to reliably estimate the importance of variables in a data set:

set.seed(123) # Set a seed for reproducibility
optRF_result = opt_importance(y=SNPdata[,1], X=SNPdata[,-1])

Once the opt_importance function is executed, the recommended number of trees and the stability metrics can be easily accessed:

summary(optRF_result)
#> Recommended number of trees: 40000
#> Expected variable importance stability: 0.958165
#> Expected selection stability: 0.5856996
#> Expected computation time (sec): 6.641934

The results indicate that the optimal number of trees for estimating variable importance is 40,000. With this configuration, the variable importance stability is 0.96 which reflects the correlation of variable importance estimates across repeated runs of random forest.

With the optimal number of trees determined, random forest can now be used to estimate variable importance. This can be achieved using the ranger function with permutation importance:

RF_model = ranger(y=SNPdata[,1], x=SNPdata[,-1], num.trees=optRF_result$recommendation,
                  write.forest = TRUE, importance="permutation")
D_VI = data.frame(variable = names(SNPdata)[-1], 
                  importance = RF_model$variable.importance)

The variable importance results are now saved in the data frame D_VI which contains the variable names and the corresponding importance estimates.

Optimising random forest for variable selection

Now that we have obtained stable estimates of variable importance, we can use these estimates to select the most important variables from the data set. However, before proceeding, it is necessary to estimate how many variables in the data set significantly affect the response variable.

One approach to achieve this is by visualising the distribution of variable importance estimates. A histogram provides an intuitive way to identify patterns such as a clear separation between important and less important variables:

hist(D_VI$importance, xlim=c(-10, 50), 
     main="Histogram of variable importances", xlab="")

From the histogram, it is apparent that the majority of variables have importance values between -5 and 5. However, a small subset of variables exceeds an importance threshold of 5. These variables are likely to have a genuine impact on the response and should be selected for further analysis.

Based on the histogram, we calculate the number of variables with importance values greater than 5:

selection_size = sum(RF_model$variable.importance>5)

Using this number of selected variables, the optimal number of trees for random forest can be determined with the opt_importance function:

set.seed(123) # Set a seed for reproducibility
optRF_result_2 = opt_importance(y=SNPdata[,1], X=SNPdata[,-1], 
                                recommendation = "selection",
                                alpha = selection_size)

With the recommended number of trees, we re-estimate variable importance and select the top variables:

RF_model_2 = ranger(y=SNPdata[,1], x=SNPdata[,-1], num.trees=optRF_result_2$recommendation,
                    write.forest = TRUE, importance="permutation")
D_VI_2 = data.frame(variable = names(SNPdata)[-1], 
                    importance = RF_model_2$variable.importance)
D_VI_2 = D_VI_2[order(D_VI_2$importance, decreasing=TRUE),]
selected_variables = D_VI_2[1:selection_size,1]

The vector selected_variables now contains the names of the most important variables in the data set. To enhance reproducibility and reliability, the selection results should be published along with the variable importance stability and selection stability metrics. These can again be easily accessed:

summary(optRF_result_2)
#> Recommended number of trees: 49000
#> Expected variable importance stability: 0.9655379
#> Expected selection stability: 0.9207115
#> Expected computation time (sec): 8.950901

Conclusions

Random forest can be effectively used to predict the response variable in a test data set where only the predictor variables are known, and the predicted response values can be used to identify the top-performing individuals. Similarly, random forest can estimate the importance of variables in a data set and facilitate the selection of the most important variables. However, it must be noted that random forest is a non-deterministic method, meaning it can yield different results even when run on the same data set. To ensure the reproducibility and reliability of the results, it is essential to determine the optimal number of trees for random forest and use this to guide decision-making. Researchers are encouraged to evaluate the stability of their chosen tree count to ensure reliable and reproducible results when using random forest.