In this tutorial, we show how to use the ocf
package to
estimate and make inference about the conditional choice probabilities
and the covariates’ marginal effects.
Before diving in the coding, we provide an overview of the statistical problem at hand.
We postulate the existence of a latent and continuous outcome variable \(Y_i^*\), assumed to obey the following regression model:
\[ Y_i^* = g \left( X_i \right) + \epsilon_i \] where \(X_i\) consists of a set of raw covariates, \(g \left( \cdot \right)\) is a potentially non-linear regression function, and \(\epsilon_i\) is independent of \(X_i\).
An observational rule links the observed outcome \(Y_i\) to the latent outcome \(Y_{i}^*\) using unknown threshold parameters \(- \infty = \zeta_0 < \zeta_1 < \dots < \zeta_{M - 1} < \zeta_M = \infty\) that define intervals on the support of \(Y_i^*\), with each interval corresponding to one of the \(M\) categories or classes of \(Y_i\):
\[ \zeta_{m - 1} < Y_i^* \leq \zeta_{m} \implies Y_i = m, \quad m = 1, \dots, M \]
The statistical targets of interest are the conditional choice probabilities:
\[ p_m \left( X_i \right) := \mathbb{P} \left( Y_i = m | X_i \right) \]
and the marginal effect of the \(j\)-th covariate on \(p_m \left( \cdot \right)\):
\[ \nabla^j p_m \left( x \right) := \begin{cases} \frac{\partial p_m \left( x \right)}{\partial x_j}, & \text{if } x_j \text{ is continuous} \\ p_m \left( \lceil x_j \rceil \right) - p_m \left( \lfloor x_j \rfloor \right), & \text{if } x_j \text{ is discrete} \end{cases} \] where \(x_j\) is the \(j\)-th element of the vector \(x\) and \(\lceil x_j \rceil\) and \(\lfloor x_j \rfloor\) correspond to \(x\) with its \(j\)-th element rounded up and down to the closest integer.
For illustration purposes, we generate a synthetic data set. Details
about the employed DGP can be retrieved by running
help(generate_ordered_data)
.
## Generate synthetic data.
set.seed(1986)
n <- 100
data <- generate_ordered_data(n)
sample <- data$sample
Y <- sample$Y
X <- sample[, -1]
table(Y)
#> Y
#> 1 2 3
#> 31 37 32
head(X)
#> x1 x2 x3 x4 x5 x6
#> 1 -0.04625141 1 -1.7879732 0 -1.0515868012 1
#> 2 0.28000082 0 -1.1553030 0 -0.4285613418 0
#> 3 0.25317063 1 1.6677330 0 0.1621459072 0
#> 4 -0.96411077 0 -0.1587051 0 0.3587438820 0
#> 5 0.49222664 0 -1.4020533 1 0.0004035277 0
#> 6 -0.69874551 1 -0.4450061 0 -0.3183447897 0
To estimate the conditional probabilities, the ocf
function constructs a collection of forests, one for each category of
Y
(three in this case). We can then use the forests to
predict out-of-sample using the predict
method.
predict
returns a matrix with the predicted probabilities
and a vector of predicted class labels (each observation is labelled to
the highest-probability class).
## Training-test split.
train_idx <- sample(seq_len(length(Y)), floor(length(Y) * 0.5))
Y_tr <- Y[train_idx]
X_tr <- X[train_idx, ]
Y_test <- Y[-train_idx]
X_test <- X[-train_idx, ]
## Fit ocf on training sample. Use default settings.
forests <- ocf(Y_tr, X_tr)
## Summary of data and tuning parameters.
summary(forests)
#> Call:
#> ocf(Y_tr, X_tr)
#>
#> Data info:
#> Full sample size: 50
#> N. covariates: 6
#> Classes: 1 2 3
#>
#> Relative variable importance:
#> x1 x2 x3 x4 x5 x6
#> 0.353 0.059 0.266 0.092 0.206 0.024
#>
#> Tuning parameters:
#> N. trees: 2000
#> mtry: 3
#> min.node.size 5
#> Subsampling scheme: No replacement
#> Honesty: FALSE
#> Honest fraction: 0
## Out-of-sample predictions.
predictions <- predict(forests, X_test)
head(predictions$probabilities)
#> P(Y=1) P(Y=2) P(Y=3)
#> [1,] 0.4224274 0.4548215 0.12275111
#> [2,] 0.4786262 0.4133636 0.10801015
#> [3,] 0.1446138 0.4064470 0.44893918
#> [4,] 0.6215123 0.3249310 0.05355674
#> [5,] 0.4359897 0.3503095 0.21370084
#> [6,] 0.6224514 0.3216924 0.05585619
table(Y_test, predictions$classification)
#>
#> Y_test 1 2 3
#> 1 11 4 1
#> 2 7 4 8
#> 3 3 1 11
To produce consistent and asymptotically normal predictions, we need
to set the honesty
argument to TRUE. This makes the
ocf
function using different parts of the training sample
to construct the forests and compute the predictions.
## Honest forests.
honest_forests <- ocf(Y_tr, X_tr, honesty = TRUE)
honest_predictions <- predict(honest_forests, X_test)
head(honest_predictions$probabilities)
#> P(Y=1) P(Y=2) P(Y=3)
#> [1,] 0.3780979 0.3260178 0.2958843
#> [2,] 0.3907848 0.3341460 0.2750692
#> [3,] 0.2350265 0.4046100 0.3603635
#> [4,] 0.4247656 0.3567507 0.2184836
#> [5,] 0.3556529 0.3095442 0.3348028
#> [6,] 0.4039818 0.3383057 0.2577125
To estimate standard errors for the predicted probabilities, we need
to set the inference
argument to TRUE. However, this works
only if honesty
is TRUE, as the formula for the variance is
valid only for honest predictions. Notice that the estimation of the
standard errors can considerably slow down the routine. However, we can
increase the number of threads used to construct the forests by using
the n.threads
argument.
To estimate the covariates’ marginal effects, we can post-process the
conditional probability predictions. This is performed by the
marginal_effects
function that can estimate mean marginal
effects, marginal effects at the mean, and marginal effects at the
median, according to the eval
argument.
## Marginal effects at the mean.
me_atmean <- marginal_effects(forests, eval = "atmean") # Try also 'eval = "atmean"' and 'eval = "mean"'.
print(me_atmean) # Try also 'latex = TRUE'.
#> ocf marginal effects results
#>
#> Data info:
#> Number of classes: 3
#> Sample size: 50
#>
#> Tuning parameters:
#> Evaluation: atmean
#> Bandwidth: 0.1
#> Number of trees: 2000
#> Honest forests: FALSE
#> Honesty fraction: 0
#>
#> Marginal Effects:
#> P'(Y=1) P'(Y=2) P'(Y=3)
#> x1 -0.100 -0.311 0.411
#> x2 -0.078 0.065 0.013
#> x3 -0.048 -0.013 0.060
#> x4 -0.024 -0.175 0.198
#> x5 0.033 0.049 -0.082
#> x6 -0.006 0.010 -0.004
As before, we can set the inference
argument to TRUE to
estimate the standard errors. Again, this requires the use of honest
forests and can considerably slow down the routine.
## Compute standard errors.
honest_me_atmean <- marginal_effects(honest_forests, eval = "atmean", inference = TRUE)
print(honest_me_atmean) # Try also 'latex = TRUE'.
#> ocf marginal effects results
#>
#> Data info:
#> Number of classes: 3
#> Sample size: 50
#>
#> Tuning parameters:
#> Evaluation: atmean
#> Bandwidth: 0.1
#> Number of trees: 2000
#> Honest forests: TRUE
#> Honesty fraction: 0.5
#>
#> Marginal Effects:
#> P'(Y=1) P'(Y=2) P'(Y=3)
#> x1 -0.017 0.000 0.017
#> x2 -0.036 0.013 0.023
#> x3 -0.011 0.009 0.002
#> x4 -0.014 -0.053 0.067
#> x5 -0.006 0.026 -0.020
#> x6 -0.002 -0.003 0.005