This package uses a two-step procedure to estimate the conditional average treatment effects (CATE) with potentially high-dimensional covariate(s). In the first stage, the nuisance functions necessary for identifying CATE can be estimated by machine learning methods, allowing the number of covariates to be comparable to or larger than the sample size. The second stage consists of a low-dimensional local linear regression, reducing CATE to a function of the covariate(s) of interest.
The CATE estimator implemented in this package not only allows for high-dimensional data, but also has the “double robustness” property: either the model for the propensity score or the models for the conditional means of the potential outcomes are allowed to be misspecified (but not both).
The algorithm used in this package is described in the paper by Fan et al., “Estimation of Conditional Average Treatment Effects With High-Dimensional Data” (2022), Journal of Business & Economic Statistics. (doi:10.1080/07350015.2020.1811102)
Install the package from github:
install.packages('devtools')
::install_github('microfan1/hdcate') devtools
Load package:
library(hdcate)
Define the empirical data (which is a usual data.frame
),
here we use a built-in simulated data:
set.seed(1) # set an arbitrary random seed to ensure the reproducibility
<- HDCATE.get_sim_data(n_obs=500, n_var=100, n_rel_var=4, intercept=10) # generate data
data #> Actual CATE function is: CATE(x)=10+1*x
The simulated data contains 500 observations, each observation has
100 covariates, only the first 4 of those covariates (X1
,
X2
, X3
and X4
) are actually
present in the expectation function of potential outcome, and only the
last 4 covariates (X97
, X98
, X99
and X100
) are present in the propensity score function.
Let’s see what the data
looks like:
data#> Y D X1 X2 X3 X4
#> 1 0.000000 0 -0.626453811 0.077303123 1.134965089 0.850043472
#> 2 0.000000 0 0.183643324 -0.296868642 1.111931845 -0.925312995
#> 3 0.000000 0 -0.835628612 -1.183242240 -0.870777634 0.893581214
#> 4 0.000000 0 1.595280802 0.011292688 0.210731585 -0.941009738
#> 5 0.000000 0 0.329507772 0.991601036 0.069395647 0.538952094
....
In the data, we know that the dependent variable is Y
,
treatment variable is D
, and we use a model with the
formula of independent variable as:
<- "X1+X2+X3+X4+X5+X6"
x_formula
x_formula#> [1] "X1+X2+X3+X4+X5+X6"
Note that the propensity score function is misspecified in this example. We shall see the “double robust” shortly.
Then, we can use the data
to create a model:
<- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula) model
Pass this model
to an operating function
HDCATE.set_condition_var()
to set the required conditional
variable, here we want to know the conditional average treatment effects
(CATE) for X2
, from X2=-1
to
X2=1
, step by 0.01
:
HDCATE.set_condition_var(model, 'X2', min=-1, max=1, step=0.01)
Then we are ready to fit the model, pass the model
to
the operating function HDCATE.fit()
to fit the model:
HDCATE.fit(model)
After the model fitting, we obtain the resulting average treatment
effects (via accesing the ate
attribute of
model
) and conditional average treatment effects (via
accesing the hdcate
attribute of model
):
# ATE
$ate
model#> [1] 9.874983
# CATE
$hdcate
model#> [1] 8.989368 8.994602 9.001866 9.011095 9.022186 9.035010 9.049411
#> [8] 9.065215 9.082236 9.100283 9.119161 9.138682 9.158664 9.178935
#> [15] 9.199339 9.219731 9.239988 9.260001 9.279679 9.298949 9.317753
#> [22] 9.336050 9.353811 9.371020 9.387670 9.403765 9.419310 9.434320
#> [29] 9.448809 9.462793 9.476288 9.489310 9.501875 9.513993 9.525677
#> [36] 9.536936 9.547775 9.558199 9.568210 9.577807 9.586984 9.595733
....
After fitting, we may want to know the uniform confidence bands, to
achieve this, pass the fitted model
to the operating
function HDCATE.inference()
:
HDCATE.inference(model)
Then we obtain the uniform confidence bands for 2-sided test,
left-sided test and right-sided test by accessing the
i_two_side
, i_left_side
and
i_right_side
attributes of the model
,
respectively:
# two-sided bands
$i_two_side
model#> Sigificant level 0.01 Sigificant level 0.01
#> CATE on X2=-1 8.070107 9.908629
#> CATE on X2=-0.99 8.079787 9.909418
#> CATE on X2=-0.98 8.091711 9.912022
#> CATE on X2=-0.97 8.105821 9.916368
#> CATE on X2=-0.96 8.122017 9.922356
....
# left-sided bands
$i_left_side
model#> Sigificant level 0.01
#> CATE on X2=-1 8.070107
#> CATE on X2=-0.99 8.079787
#> CATE on X2=-0.98 8.091711
#> CATE on X2=-0.97 8.105821
#> CATE on X2=-0.96 8.122017
....
# right-sided bands
$i_right_side
model#> Sigificant level 0.01
#> CATE on X2=-1 9.908629
#> CATE on X2=-0.99 9.909418
#> CATE on X2=-0.98 9.912022
#> CATE on X2=-0.97 9.916368
#> CATE on X2=-0.96 9.922356
....
Finally, we may want a plot for the estimated results above:
HDCATE.plot(model)
# Alternatively, if the model is not inferenced, use the code below:
# HDCATE.plot(model, include_band=FALSE)
# Alternatively, we may want to save the full plot as a high-definition PDF file:
# HDCATE.plot(model, output_pdf=TRUE, pdf_name='hdcate_plot.pdf',
# include_band=TRUE, test_side='both', y_axis_min='auto', y_axis_max='auto',
# display.hdcate='HDCATEF', display.ate='ATE', display.siglevel='sig_level'
# )
From the graph above, although the model we use is misspecified for
the propensity score function (we used the formula
Y~X1+X2+X3+X4+X5+X6
), we still obtain a robust result,
which is very close to the actual CATE function
(CATE(X2)=10+X2
). If we use a model that is misspecified
for the expectation function of potential outcome, the results are also
robust:
= paste(paste0('X', c(90:100)), collapse ='+') # x_formula="X90+...+X100"
x_formula
<- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula)
model
HDCATE.set_condition_var(model, 'X2', min=-1, max=1, step=0.01)
HDCATE.fit(model)
HDCATE.inference(model)
HDCATE.plot(model)
The results above show the “double robust” property of the estimating approach in this package, see Fan et al. (2022) for more theoretical results.
By default, when creating an HDCATE model, this package uses
full-sample approach, but you are also very welcome to use the
cross-fitting approach described in Fan et
al. (2022), by simply pass model
into the operator
HDCATE.use_cross_fitting()
right after the HDCATE
model
has been created, that is:
# create the model first, or use an existing `model` and skip this line
<- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula)
model
# suppose we want to use the 5-fold cross-fitting approach:
HDCATE.use_cross_fitting(model, k_fold=5)
You may also want to set the folds manually, in this case, pass a list of index vector to the third argument of the operator:
# In this case, the param k_fold is auto detected, you can pass any value to it.
HDCATE.use_cross_fitting(model, k_fold=1, folds=list(c(1:250), c(251:500)))
If we wants to use full-sample approach again, then we can use
anothor operator HDCATE.use_full_sample()
.
HDCATE.use_full_sample(model)
By default, the package uses LASSO on the first stage (i.e. the estimation of the treated/untreated expectation functions and propensity score functions), but it is highly (and easily) extendable.
Researchers are welcome to define their own preferred methods (such as other ML methods) to run the first-stage estimation.
To define your own approach, you should define several functions that
describe how you fit and predict the input data, and then pass those
functions (and the model
object) to the operator
HDCATE.set_first_stage()
. Then the package will use your
method on the first stage.
The template of those functions are given as follows:
<- function(df) {
my_method_to_fit_expectation # fit the input data.frame (df), where the first column is outcome value (Y)
# and the rest are covariates in x_formula
# ...
# return a fitted object
}<- function(fitted_model, df) {
my_method_to_predict_expectation # use the fitted object (returned by your function above) to predict the conditional expectation
# fucntion on the input data.frame (df)
# ...
# return a vector of predicted values
}<- function(df) {
my_method_to_fit_propensity # fit the input data.frame (df), where the first column is dummy treatment
# value (D) and the rest are covariates in x_formula
# ...
# return a fitted object
}<- function(fitted_model, df) {
my_method_to_predict_propensity # use the fitted object (returned by your function above) to predict the propensity score
# fucntion on the input data.frame (df)
# ...
# return a vector of predicted values
}
Example 1 (LASSO): the default ML method in
HDCATE.fit
is LASSO, users can manually define a ML method
using the above template, and we again use LASSO as an example, similar
practice applies for other ML methods:
Note: In practice, LASSO is the built-in method in the package, if you just want to use LASSO, there’s no need to do the following.
<- "X1+X2+X3+X4+X5+X6" # propensity function is misspecified
x_formula # create a model
<- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula)
model
# using the template above, we can write:
<- function(df) {
my_method_to_fit_expectation # fit the input data.frame (df), where the first column is outcome value (Y)
# and the rest are covariates in x_formula
::rlasso(as.formula(paste0('Y', "~", x_formula)), df)
hdm# return a fitted object
}<- function(fitted_model, df) {
my_method_to_predict_expectation # use the fitted object (fitted_model) to predict the conditional expectation
# fucntion on the input data.frame (df)
predict(fitted_model, df)
# return a vector of predicted values
}<- function(df) {
my_method_to_fit_propensity # fit the input data.frame (df), where the first column is dummy treatment
# variable (D) and the rest are covariates in x_formula
::rlassologit(as.formula(paste0('D', "~", x_formula)), df)
hdm# return a fitted object
}<- function(fitted_model, df) {
my_method_to_predict_propensity # use the fitted object (fitted_model) to predict the propensity score
# fucntion on the input data.frame (df)
predict(fitted_model, df, type="response")
# return a vector of predicted values
}
After defining these functions, we can easily plug them into the
package using operator HDCATE.set_first_stage()
:
HDCATE.set_first_stage(
model,fit.treated=my_method_to_fit_expectation,
fit.untreated=my_method_to_fit_expectation,
fit.propensity=my_method_to_fit_propensity,
predict.treated=my_method_to_predict_expectation,
predict.untreated=my_method_to_predict_expectation,
predict.propensity=my_method_to_predict_propensity
)
And then we can use the user-defined first-stage ML methods to estimate CATE:
HDCATE.set_condition_var(model, 'X2', min=-1, max=1, step=0.01)
HDCATE.fit(model)
HDCATE.inference(model)
HDCATE.plot(model)
We shall see the similar result, since the user-defined method in this example is the same as the built-in method (LASSO).
To clear the user-defined first-stage method and return to the default built-in method, simply call:
HDCATE.unset_first_stage(model)
Example 2 (Random Forest): Here we provide an additional example for another popular ML method: Random Forest, to further show how users can easily plug their own ML methods into this package:
# propensity function is misspecified, for example
<- "X1+X2+X3+X4+X5+X6"
x_formula # create a model
<- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula)
model
# =========== User-defined ML approach start ===========
# using the template above, we can write:
<- function(df) {
my_method_to_fit_expectation # fit the input data.frame (df), where the first column is outcome value (Y)
# and the rest are covariates in x_formula
::randomForest(as.formula(paste0('Y', "~", x_formula)), data = df)
randomForest# return a fitted object
}<- function(fitted_model, df) {
my_method_to_predict_expectation # use the fitted object (fitted_model) to predict the conditional expectation
# fucntion on the input data.frame (df)
predict(fitted_model, df)
# return a vector of predicted values
}<- function(df) {
my_method_to_fit_propensity # fit the input data.frame (df), where the first column is dummy treatment
# variable (D) and the rest are covariates in x_formula
::randomForest(as.formula(paste0('D', "~", x_formula)), data = df)
randomForest# return a fitted object
}<- function(fitted_model, df) {
my_method_to_predict_propensity # use the fitted object (fitted_model) to predict the propensity score
# fucntion on the input data.frame (df)
predict(fitted_model, df)
# return a vector of predicted values
}
HDCATE.set_first_stage(
model,fit.treated=my_method_to_fit_expectation,
fit.untreated=my_method_to_fit_expectation,
fit.propensity=my_method_to_fit_propensity,
predict.treated=my_method_to_predict_expectation,
predict.untreated=my_method_to_predict_expectation,
predict.propensity=my_method_to_predict_propensity
)
# =========== User-defined ML approach end ===========
# set conditional variable in CATE, same as above
HDCATE.set_condition_var(model, 'X2', min=-1, max=1, step=0.01)
# fit, inference and plot
HDCATE.fit(model)
HDCATE.inference(model)
HDCATE.plot(model)
As shown above, after applying user-defined Random Forest approach into this package, the CATE estimation result seems even better than that using LASSO in respect of confidence bands.
Futhermore, users may want to extract the fitted ML model for
analysis, this can be achieved by accessing the
propensity_score_model_list
,
untreated_cond_exp_model_list
and
treated_cond_exp_model_list
attributes of the
model
object for the fitted propensity score model, fitted
conditional expectation model (for untreated) and fitted conditional
expectation model (for treated), respectively. The value of these
attributes are “List of K
”, where K
is the
number of folds, K=1
when using the default full-sample
estimator.
For example, we may want to calculate the variable importance in fitting treated conditional expectations using the fitted random forest model:
# the list have only one element since we use full-sample estimator
<- model$treated_cond_exp_model_list[[1]]
rf_model
# then, we can use the fitted object `rf_model` for futher analysis.
# the rest of codes are omitted since they are unconcerned for this package.
# ...
We can see that in this example, the powerful random forest model
captures the actual relevant variable of conditional expectation
functions of potential outcome (i.e. X1
, X2
,
X3
and X4
) correctly.
Similar practice applies for other ML methods.
Users can define their own inference procedure by passing the
following arguments into HDCATE.inference()
:
sig_level
: set significant level of the confidence
bands, the default is 0.01, passing a vector of significant levels is
also allowed.n_rep_boot
: set the number of bootstrap replications,
the default is 1000.For example, we may want to construct uniform confidence bands with a larger number of bootstrap replications=3000, and use another significant level=5%:
HDCATE.inference(model, sig_level=0.05, n_rep_boot=3000)
On the second stage, we use rule-of-thumb to decide the bandwidth in
the local linear regression, the bandwidth can also be manually set
before calling HDCATE.fit()
:
HDCATE.set_bw(model, 0.15) # for example, set bandwidth=0.15
To reset to default rule-of-thumb approach, call:
HDCATE.set_bw(model)
Other details for using this package can be found on the detailed package documentation or the help files below:
help(HDCATE)
help(HDCATE.set_condition_var)
help(HDCATE.fit)
help(HDCATE.inference)
help(HDCATE.plot)
help(HDCATE.use_cross_fitting)
help(HDCATE.use_full_sample)
help(HDCATE.set_first_stage)
help(HDCATE.unset_first_stage)
help(HDCATE.set_bw)
Qingliang Fan, Yu-Chin Hsu, Robert P. Lieli & Yichong Zhang (2022) Estimation of Conditional Average Treatment Effects With High-Dimensional Data, Journal of Business & Economic Statistics, 40:1, 313-327, DOI: 10.1080/07350015.2020.1811102