staRburst makes it trivial to scale your parallel R code from your laptop to 100+ AWS workers. This vignette walks through setup and common usage patterns.
Before using staRburst, you need to configure AWS resources. This only needs to be done once.
This will: - Validate your AWS credentials - Create an S3 bucket for data transfer - Create an ECR repository for Docker images - Set up ECS cluster and VPC resources - Check Fargate quotas and offer to request increases
The simplest way to use staRburst is with the furrr
package:
library(furrr)
library(starburst)
# Define your work
expensive_simulation <- function(i) {
# Some computation that takes a few minutes
results <- replicate(1000, {
x <- rnorm(10000)
mean(x^2)
})
mean(results)
}
# Local execution (single core)
plan(sequential)
system.time({
results_local <- future_map(1:100, expensive_simulation)
})
#> ~16 minutes on typical laptop
# Cloud execution (50 workers)
plan(future_starburst, workers = 50)
system.time({
results_cloud <- future_map(1:100, expensive_simulation)
})
#> ~2 minutes (including 45s startup)
#> Cost: ~$0.85
# Results are identical
identical(results_local, results_cloud)
#> [1] TRUElibrary(starburst)
library(furrr)
# Simulate portfolio returns
simulate_portfolio <- function(seed) {
set.seed(seed)
# Random walk for 252 trading days
returns <- rnorm(252, mean = 0.0003, sd = 0.02)
prices <- cumprod(1 + returns)
list(
final_value = prices[252],
max_drawdown = max(cummax(prices) - prices) / max(prices),
sharpe_ratio = mean(returns) / sd(returns) * sqrt(252)
)
}
# Run 10,000 simulations on 100 workers
plan(future_starburst, workers = 100)
results <- future_map(1:10000, simulate_portfolio, .options = furrr_options(seed = TRUE))
# Analyze results
final_values <- sapply(results, `[[`, "final_value")
hist(final_values, breaks = 50, main = "Distribution of Portfolio Final Values")
# 95% confidence interval
quantile(final_values, c(0.025, 0.975))Performance: - Local (single core): ~4 hours - Cloud (100 workers): ~3 minutes - Cost: ~$1.80
library(starburst)
library(furrr)
# Your data
data <- read.csv("my_data.csv")
# Bootstrap function
bootstrap_regression <- function(i, data) {
# Resample with replacement
boot_indices <- sample(nrow(data), replace = TRUE)
boot_data <- data[boot_indices, ]
# Fit model
model <- lm(y ~ x1 + x2 + x3, data = boot_data)
# Return coefficients
coef(model)
}
# Run 10,000 bootstrap samples
plan(future_starburst, workers = 50)
boot_results <- future_map(1:10000, bootstrap_regression, data = data)
# Convert to matrix
boot_coefs <- do.call(rbind, boot_results)
# 95% confidence intervals for each coefficient
apply(boot_coefs, 2, quantile, probs = c(0.025, 0.975))library(starburst)
library(furrr)
# Process one sample
process_sample <- function(sample_id) {
# Read from S3 (data already in cloud)
fastq_path <- sprintf("s3://my-genomics-data/samples/%s.fastq", sample_id)
data <- read_fastq(fastq_path)
# Align reads
aligned <- align_reads(data, reference = "hg38")
# Call variants
variants <- call_variants(aligned)
# Return summary
list(
sample_id = sample_id,
num_variants = nrow(variants),
variants = variants
)
}
# Process 1000 samples on 100 workers
sample_ids <- list.files("s3://my-genomics-data/samples/", pattern = ".fastq$")
plan(future_starburst, workers = 100)
results <- future_map(sample_ids, process_sample, .progress = TRUE)
# Combine results
all_variants <- do.call(rbind, lapply(results, `[[`, "variants"))Performance: - Local (sequential): ~208 hours (8.7 days) - Cloud (100 workers): ~2 hours - Cost: ~$47
If your data is already in S3, workers can read it directly:
For smaller datasets, you can pass data as arguments:
For very large objects, pre-upload to S3:
# Upload once
large_data <- read.csv("huge_file.csv")
s3_path <- starburst_upload(large_data, "s3://my-bucket/large_data.rds")
# Workers read from S3
plan(future_starburst, workers = 100)
results <- future_map(1:1000, function(i) {
# Read from S3 inside worker
data <- readRDS(s3_path)
process(data, i)
})# Set maximum cost per job
starburst_config(
max_cost_per_job = 10, # Don't start jobs that would cost >$10
cost_alert_threshold = 5 # Warn when approaching $5
)
# Now jobs exceeding limit will error before starting
plan(future_starburst, workers = 1000) # Would cost ~$35/hour
#> Error: Estimated cost ($35/hr) exceeds limit ($10/hr)If you request more workers than your quota allows, staRburst automatically uses wave-based execution:
# Quota allows 25 workers, but you request 100
plan(future_starburst, workers = 100, cpu = 4)
#> β Requested: 100 workers (400 vCPUs)
#> β Current quota: 100 vCPUs (allows 25 workers max)
#>
#> π Execution plan:
#> β’ Running in 4 waves of 25 workers each
#>
#> π‘ Request quota increase to 500 vCPUs? [y/n]: y
#>
#> β Quota increase requested
#> β‘ Starting wave 1 (25 workers)...
results <- future_map(1:1000, expensive_function)
#> β‘ Wave 1: 100% complete (250 tasks)
#> β‘ Wave 2: 100% complete (500 tasks)
#> β‘ Wave 3: 100% complete (750 tasks)
#> β‘ Wave 4: 100% complete (1000 tasks)Environment mismatch: Packages not found on workers
Task failures: Some tasks failing
# Check logs
starburst_logs(task_id = "failed-task-id")
# Often due to memory limits - increase worker memory
plan(future_starburst, workers = 50, memory = "16GB") # Default is 8GBSlow data transfer: Large objects taking too long
β Good: Each task takes >5 minutes
β Bad: Each task takes <1 minute
Instead of:
Do:
Donβt:
big_data <- read.csv("10GB_file.csv") # Upload for every task
results <- future_map(1:1000, function(i) process(big_data, i))Do: