sae.projection: A Model-Assisted Projection Estimator for Combining Independent Surveys

Ridson Al Farizal P (ridsonap@bps.go.id)

2025-07-06

Introduction

The ma_projection() function implements a model-assisted projection estimator for combining information from two independent surveys. This method is especially useful in survey sampling scenarios where:

This vignette illustrates how to use ma_projection() for domain-level estimation using various supervised learning models, including machine learning techniques via the parsnip interface.

Method Overview

The approach follows the work of Kim & Rao (2012), where a working model is trained on Survey 2 to predict the outcome variable. Predictions are made for the auxiliary-only Survey 1 data. These predictions are then aggregated by domain to generate small area estimates.

Required Packages

library(sae.projection)
library(dplyr)
library(tidymodels)
library(bonsai)  # for modern tree-based models

Example: Income Estimation Using Linear Regression

# Filter non-missing values for income
svy22_income <- df_svy22 %>% filter(!is.na(income))
svy23_income <- df_svy23 %>% filter(!is.na(income))

# Fit projection model
lm_result <- ma_projection(
  income ~ age + sex + edu + disability,
  cluster_ids = "PSU",
  weight = "WEIGHT",
  strata = "STRATA",
  domain = c("PROV", "REGENCY"),
  working_model = linear_reg(),
  data_model = svy22_income,
  data_proj = svy23_income,
  nest = TRUE
)

# View results
head(lm_result$df_result)

Example: Binary Outcome Using Logistic Regression

# Filter youth population for NEET classification
svy22_neet <- df_svy22 %>% filter(between(age, 15, 24))
svy23_neet <- df_svy23 %>% filter(between(age, 15, 24))

# Fit logistic regression model
lr_result <- ma_projection(
  formula = neet ~ sex + edu + disability,
  cluster_ids = ~ PSU,
  weight = ~ WEIGHT,
  strata = ~ STRATA,
  domain = ~ PROV + REGENCY,
  working_model = logistic_reg(),
  data_model = svy22_neet,
  data_proj = svy23_neet,
  nest = TRUE
)

# View results
head(lr_result$df_result)

Example: LightGBM with Hyperparameter Tuning

# Define LightGBM model with tuning
lgbm_model <- boost_tree(
  mtry = tune(), trees = tune(), min_n = tune(),
  tree_depth = tune(), learn_rate = tune(),
  engine = "lightgbm"
)

# Fit with cross-validation
lgbm_result <- ma_projection(
  formula = neet ~ sex + edu + disability,
  cluster_ids = "PSU",
  weight = "WEIGHT",
  strata = "STRATA",
  domain = c("PROV", "REGENCY"),
  working_model = lgbm_model,
  data_model = svy22_neet,
  data_proj = svy23_neet,
  cv_folds = 3,
  tuning_grid = 5,
  nest = TRUE
)

# View results
head(lgbm_result$df_result)

Supported Models

ma_projection() supports many working models using the parsnip interface, including:

References

Kim, J. K., & Rao, J. N. (2012). Combining data from two independent surveys: a model-assisted approach. Biometrika, 99(1), 85–100. doi:10.1093/biomet/asr063

Conclusion

ma_projection() provides a flexible and robust way to combine survey data using modern modeling tools. It supports a wide range of use cases including socioeconomic indicators, health estimates, and more.