Short Tutorial

In this tutorial, we show how to use the ocf package to estimate and make inference about the conditional choice probabilities and the covariates’ marginal effects.

Before diving in the coding, we provide an overview of the statistical problem at hand.

Ordered Choice Models

We postulate the existence of a latent and continuous outcome variable \(Y_i^*\), assumed to obey the following regression model:

\[ Y_i^* = g \left( X_i \right) + \epsilon_i \] where \(X_i\) consists of a set of raw covariates, \(g \left( \cdot \right)\) is a potentially non-linear regression function, and \(\epsilon_i\) is independent of \(X_i\).

An observational rule links the observed outcome \(Y_i\) to the latent outcome \(Y_{i}^*\) using unknown threshold parameters \(- \infty = \zeta_0 < \zeta_1 < \dots < \zeta_{M - 1} < \zeta_M = \infty\) that define intervals on the support of \(Y_i^*\), with each interval corresponding to one of the \(M\) categories or classes of \(Y_i\):

\[ \zeta_{m - 1} < Y_i^* \leq \zeta_{m} \implies Y_i = m, \quad m = 1, \dots, M \]

The statistical targets of interest are the conditional choice probabilities:

\[ p_m \left( X_i \right) := \mathbb{P} \left( Y_i = m | X_i \right) \]

and the marginal effect of the \(j\)-th covariate on \(p_m \left( \cdot \right)\):

\[ \nabla^j p_m \left( x \right) := \begin{cases} \frac{\partial p_m \left( x \right)}{\partial x_j}, & \text{if } x_j \text{ is continuous} \\ p_m \left( \lceil x_j \rceil \right) - p_m \left( \lfloor x_j \rfloor \right), & \text{if } x_j \text{ is discrete} \end{cases} \] where \(x_j\) is the \(j\)-th element of the vector \(x\) and \(\lceil x_j \rceil\) and \(\lfloor x_j \rfloor\) correspond to \(x\) with its \(j\)-th element rounded up and down to the closest integer.

Code

For illustration purposes, we generate a synthetic data set. Details about the employed DGP can be retrieved by running help(generate_ordered_data).

## Generate synthetic data.
set.seed(1986)

n <- 100
data <- generate_ordered_data(n)

sample <- data$sample
Y <- sample$Y
X <- sample[, -1]

table(Y)
#> Y
#>  1  2  3 
#> 31 37 32

head(X)
#>            x1 x2         x3 x4            x5 x6
#> 1 -0.04625141  1 -1.7879732  0 -1.0515868012  1
#> 2  0.28000082  0 -1.1553030  0 -0.4285613418  0
#> 3  0.25317063  1  1.6677330  0  0.1621459072  0
#> 4 -0.96411077  0 -0.1587051  0  0.3587438820  0
#> 5  0.49222664  0 -1.4020533  1  0.0004035277  0
#> 6 -0.69874551  1 -0.4450061  0 -0.3183447897  0

Conditional Probabilities

To estimate the conditional probabilities, the ocf function constructs a collection of forests, one for each category of Y (three in this case). We can then use the forests to predict out-of-sample using the predict method. predict returns a matrix with the predicted probabilities and a vector of predicted class labels (each observation is labelled to the highest-probability class).

## Training-test split.
train_idx <- sample(seq_len(length(Y)), floor(length(Y) * 0.5))

Y_tr <- Y[train_idx]
X_tr <- X[train_idx, ]

Y_test <- Y[-train_idx]
X_test <- X[-train_idx, ]

## Fit ocf on training sample. Use default settings.
forests <- ocf(Y_tr, X_tr)

## Summary of data and tuning parameters.
summary(forests)
#> Call: 
#> ocf(Y_tr, X_tr) 
#> 
#> Data info: 
#> Full sample size:   50 
#> N. covariates:      6 
#> Classes:            1 2 3 
#> 
#> Relative variable importance: 
#>    x1    x2    x3    x4    x5    x6 
#> 0.353 0.059 0.266 0.092 0.206 0.024 
#> 
#> Tuning parameters: 
#> N. trees:           2000 
#> mtry:               3 
#> min.node.size       5 
#> Subsampling scheme: No replacement 
#> Honesty:            FALSE 
#> Honest fraction:    0

## Out-of-sample predictions.
predictions <- predict(forests, X_test)

head(predictions$probabilities)
#>         P(Y=1)    P(Y=2)     P(Y=3)
#> [1,] 0.4224274 0.4548215 0.12275111
#> [2,] 0.4786262 0.4133636 0.10801015
#> [3,] 0.1446138 0.4064470 0.44893918
#> [4,] 0.6215123 0.3249310 0.05355674
#> [5,] 0.4359897 0.3503095 0.21370084
#> [6,] 0.6224514 0.3216924 0.05585619

table(Y_test, predictions$classification)
#>       
#> Y_test  1  2  3
#>      1 11  4  1
#>      2  7  4  8
#>      3  3  1 11

To produce consistent and asymptotically normal predictions, we need to set the honesty argument to TRUE. This makes the ocf function using different parts of the training sample to construct the forests and compute the predictions.

## Honest forests.
honest_forests <- ocf(Y_tr, X_tr, honesty = TRUE)

honest_predictions <- predict(honest_forests, X_test)
head(honest_predictions$probabilities)
#>         P(Y=1)    P(Y=2)    P(Y=3)
#> [1,] 0.3780979 0.3260178 0.2958843
#> [2,] 0.3907848 0.3341460 0.2750692
#> [3,] 0.2350265 0.4046100 0.3603635
#> [4,] 0.4247656 0.3567507 0.2184836
#> [5,] 0.3556529 0.3095442 0.3348028
#> [6,] 0.4039818 0.3383057 0.2577125

To estimate standard errors for the predicted probabilities, we need to set the inference argument to TRUE. However, this works only if honesty is TRUE, as the formula for the variance is valid only for honest predictions. Notice that the estimation of the standard errors can considerably slow down the routine. However, we can increase the number of threads used to construct the forests by using the n.threads argument.

## Compute standard errors. Do not run.
# honest_forests <- ocf(Y_tr, X_tr, honesty = TRUE, inference = TRUE, n.threads = 0) # Use all CPUs.
# head(honest_forests$predictions$standard.errors)

Covariates’ Marginal Effects

To estimate the covariates’ marginal effects, we can post-process the conditional probability predictions. This is performed by the marginal_effects function that can estimate mean marginal effects, marginal effects at the mean, and marginal effects at the median, according to the eval argument.

## Marginal effects at the mean.
me_atmean <- marginal_effects(forests, eval = "atmean") # Try also 'eval = "atmean"' and 'eval = "mean"'.
print(me_atmean) # Try also 'latex = TRUE'.
#> ocf marginal effects results 
#> 
#> Data info: 
#> Number of classes:    3 
#> Sample size:          50 
#> 
#> Tuning parameters: 
#> Evaluation:           atmean 
#> Bandwidth:            0.1 
#> Number of trees:      2000 
#> Honest forests:       FALSE 
#> Honesty fraction:     0 
#> 
#> Marginal Effects: 
#>    P'(Y=1) P'(Y=2) P'(Y=3)
#> x1  -0.100  -0.311   0.411
#> x2  -0.078   0.065   0.013
#> x3  -0.048  -0.013   0.060
#> x4  -0.024  -0.175   0.198
#> x5   0.033   0.049  -0.082
#> x6  -0.006   0.010  -0.004

As before, we can set the inference argument to TRUE to estimate the standard errors. Again, this requires the use of honest forests and can considerably slow down the routine.

## Compute standard errors.
honest_me_atmean <- marginal_effects(honest_forests, eval = "atmean", inference = TRUE)
print(honest_me_atmean) # Try also 'latex = TRUE'.
#> ocf marginal effects results 
#> 
#> Data info: 
#> Number of classes:    3 
#> Sample size:          50 
#> 
#> Tuning parameters: 
#> Evaluation:           atmean 
#> Bandwidth:            0.1 
#> Number of trees:      2000 
#> Honest forests:       TRUE 
#> Honesty fraction:     0.5 
#> 
#> Marginal Effects: 
#>    P'(Y=1) P'(Y=2) P'(Y=3)
#> x1  -0.017   0.000   0.017
#> x2  -0.036   0.013   0.023
#> x3  -0.011   0.009   0.002
#> x4  -0.014  -0.053   0.067
#> x5  -0.006   0.026  -0.020
#> x6  -0.002  -0.003   0.005