
This guide provides a quick introduction to using mLLMCelltype for cell type annotation in single-cell RNA sequencing data. We’ll cover the basic workflow, input data requirements, and a simple example to get you started.
The mLLMCelltype workflow consists of these main steps:
First, load the mLLMCelltype package:
library(mLLMCelltype)Before using mLLMCelltype, you need to set up API keys for the LLM providers you plan to use:
# Set API keys as environment variables
Sys.setenv(ANTHROPIC_API_KEY = "your-anthropic-api-key")  # For Claude models
Sys.setenv(OPENAI_API_KEY = "your-openai-api-key")        # For GPT models
Sys.setenv(GEMINI_API_KEY = "your-gemini-api-key")        # For Gemini models
Sys.setenv(OPENROUTER_API_KEY = "your-openrouter-api-key") # For OpenRouter modelsYou can obtain API keys from: - Anthropic: https://console.anthropic.com/ - OpenAI: https://platform.openai.com/ - Google (Gemini): https://ai.google.dev/ - OpenRouter: https://openrouter.ai/keys
Alternatively, you can provide API keys directly in function calls:
results <- annotate_cell_types(
  input = markers,
  tissue_name = "human PBMC",
  model = "claude-3-7-sonnet-20250219",
  api_key = "your-anthropic-api-key",  # Direct API key
  top_gene_count = 10
)mLLMCelltype accepts marker gene data in several formats:
A data frame with the following columns: - cluster:
Cluster ID (must be 0-based) - gene: Gene name/symbol -
avg_log2FC or similar metric: Log fold change -
p_val_adj or similar metric: Adjusted p-value
Example:
# Example marker data frame
markers_df <- data.frame(
  cluster = c(0, 0, 0, 1, 1, 1),
  gene = c("CD3D", "CD3E", "CD2", "CD14", "LYZ", "CST3"),
  avg_log2FC = c(2.5, 2.3, 2.1, 3.1, 2.8, 2.5),
  p_val_adj = c(0.001, 0.001, 0.002, 0.0001, 0.0002, 0.0005)
)You can directly use the output from Seurat’s
FindAllMarkers() function:
# Assuming you have a Seurat object named 'seurat_obj'
library(Seurat)
all_markers <- FindAllMarkers(seurat_obj, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)A path to a CSV file containing marker gene data:
# Path to your CSV file
markers_file <- "path/to/markers.csv"A list where each element contains marker genes for a cluster:
# Example marker list
markers_list <- list(
  "0" = c("CD3D", "CD3E", "CD2", "IL7R", "LTB"),
  "1" = c("CD14", "LYZ", "CST3", "MS4A7", "FCGR3A")
)The annotate_cell_types function has the following
parameters:
| Parameter | Description | Default Value | 
|---|---|---|
| input | Marker gene data (data frame, list, or file path) | (required) | 
| tissue_name | Tissue name (e.g., “human PBMC”, “mouse brain”) | NULL | 
| model | LLM model to use | "gpt-4o" | 
| api_key | API key (if not set in environment) | NA | 
| top_gene_count | Number of top genes per cluster to use | 10 | 
| debug | Whether to print debugging information | FALSE | 
Note: If api_key is set to NA, the function
will return the generated prompt without making an API call, which is
useful for reviewing the prompt before sending it to the API.
Here’s a simple example using a single LLM model for annotation:
# Example marker data
markers <- data.frame(
  cluster = c(0, 0, 0, 0, 0, 1, 1, 1, 1, 1),
  gene = c("CD3D", "CD3E", "CD2", "IL7R", "LTB", "CD14", "LYZ", "CST3", "MS4A7", "FCGR3A"),
  avg_log2FC = c(2.5, 2.3, 2.1, 1.8, 1.7, 3.1, 2.8, 2.5, 2.2, 2.0),
  p_val_adj = c(0.001, 0.001, 0.002, 0.003, 0.005, 0.0001, 0.0002, 0.0005, 0.001, 0.002)
)
# Run annotation with a single model
results <- annotate_cell_types(
  input = markers,
  tissue_name = "human PBMC",
  model = "claude-3-7-sonnet-20250219",
  api_key = Sys.getenv("ANTHROPIC_API_KEY"),
  top_gene_count = 10,
  debug = FALSE  # Set to TRUE for more detailed output
)
# Print results
print(results)When using a single model like Claude, the output will be a character vector with one annotation per cluster:
> print(results)
[1] "0: T cells"   "1: Monocytes"For more reliable annotations, you can use multiple models and create a consensus:
# Define models to use
models <- c(
  "claude-3-7-sonnet-20250219",  # Anthropic
  "gpt-4o",                      # OpenAI
  "gemini-1.5-pro"               # Google
)
# API keys for different providers
api_keys <- list(
  anthropic = Sys.getenv("ANTHROPIC_API_KEY"),
  openai = Sys.getenv("OPENAI_API_KEY"),
  gemini = Sys.getenv("GEMINI_API_KEY")
)
# Run annotation with multiple models
results <- list()
for (model in models) {
  provider <- get_provider(model)
  api_key <- api_keys[[provider]]
  results[[model]] <- annotate_cell_types(
    input = markers,
    tissue_name = "human PBMC",
    model = model,
    api_key = api_key,
    top_gene_count = 10
  )
}
# Create consensus
consensus_results <- interactive_consensus_annotation(
  input = markers,
  tissue_name = "human PBMC",
  models = models,  # Use all the models defined above
  api_keys = api_keys,
  controversy_threshold = 0.7,
  entropy_threshold = 1.0,
  consensus_check_model = "claude-3-7-sonnet-20250219"
)
# Print consensus results
print_consensus_summary(consensus_results)The consensus results contain more detailed information:
> print_consensus_summary(consensus_results)
Consensus Summary:
-----------------
Total clusters: 2
Controversial clusters: 0
Consensus achieved for all clusters
Cluster 0:
  Final annotation: T cells
  Consensus proportion: 1.0
  Entropy: 0.0
  Model predictions:
    - claude-3-7-sonnet-20250219: T cells
    - gpt-4o: T cells
    - gemini-1.5-pro: T cells
Cluster 1:
  Final annotation: Monocytes
  Consensus proportion: 1.0
  Entropy: 0.0
  Model predictions:
    - claude-3-7-sonnet-20250219: Monocytes
    - gpt-4o: Monocytes
    - gemini-1.5-pro: MonocytesTo add the annotations to your Seurat object:
# Assuming you have a Seurat object named 'seurat_obj' and consensus results
library(Seurat)
# Add consensus annotations to Seurat object
seurat_obj$cell_type_consensus <- plyr::mapvalues(
  x = as.character(Idents(seurat_obj)),
  from = as.character(0:(length(consensus_results$final_annotations)-1)),
  to = consensus_results$final_annotations
)
# Extract consensus metrics from the consensus results
# Note: These metrics are available in the consensus_results$initial_results$consensus_results
consensus_metrics <- lapply(names(consensus_results$initial_results$consensus_results), function(cluster_id) {
  metrics <- consensus_results$initial_results$consensus_results[[cluster_id]]
  return(list(
    cluster = cluster_id,
    consensus_proportion = metrics$consensus_proportion,
    entropy = metrics$entropy
  ))
})
# Convert to data frame for easier handling
metrics_df <- do.call(rbind, lapply(consensus_metrics, data.frame))
# Add consensus proportion to Seurat object
seurat_obj$consensus_proportion <- plyr::mapvalues(
  x = as.character(Idents(seurat_obj)),
  from = metrics_df$cluster,
  to = metrics_df$consensus_proportion
)
# Add entropy to Seurat object
seurat_obj$entropy <- plyr::mapvalues(
  x = as.character(Idents(seurat_obj)),
  from = metrics_df$cluster,
  to = metrics_df$entropy
)Here’s a simple visualization of the results using Seurat:
# Plot UMAP with cell type annotations
DimPlot(seurat_obj, group.by = "cell_type_consensus", label = TRUE, repel = TRUE) +
  ggtitle("Cell Type Annotations") +
  theme(plot.title = element_text(hjust = 0.5))The output of annotate_cell_types() is a vector of cell
type annotations, where each element corresponds to a cluster.
The output of interactive_consensus_annotation() is a
list containing:
final_annotations: Final consensus cell type
annotationsinitial_results: Initial predictions from each
modelcontroversial_clusters: List of clusters that required
discussiondiscussion_logs: Detailed logs of the discussion
processsession_id: Unique identifier for the annotation
sessionWhen using consensus annotation, two key metrics help evaluate the reliability of annotations:
Clusters with low consensus proportion or high entropy may require manual review.
If you don’t have access to paid API keys, you can use OpenRouter’s free models:
# Set OpenRouter API key
Sys.setenv(OPENROUTER_API_KEY = "your-openrouter-api-key")
# Use a free model
free_results <- annotate_cell_types(
  input = markers,
  tissue_name = "human PBMC",
  model = "meta-llama/llama-4-maverick:free",  # Note the :free suffix
  api_key = Sys.getenv("OPENROUTER_API_KEY"),
  top_gene_count = 10
)
# Print results
print(free_results)Available free models include:
meta-llama/llama-4-maverick:free - Meta Llama 4
Maverick (256K context)nvidia/llama-3.1-nemotron-ultra-253b-v1:free - NVIDIA
Nemotron Ultra 253Bdeepseek/deepseek-chat-v3-0324:free - DeepSeek Chat
v3microsoft/mai-ds-r1:free - Microsoft MAI-DS-R1Free models don’t consume credits but may have limitations compared to paid models.
API Key Not Found:
Error: No auth credentials foundSolution: Ensure you’ve set the correct API key environment variable or provided it directly in the function call.
Rate Limiting:
Error: Rate limit exceededSolution: Wait a few minutes before trying again, or reduce the number of API calls by processing fewer clusters at once.
Invalid Model Name:
Error: Unsupported model: [model_name]Solution: Check that you’re using a supported model name and that it’s spelled correctly.
Network Issues:
Error: Could not connect to APISolution: Check your internet connection and try again. If the problem persists, the API service might be down.
Now that you understand the basics of mLLMCelltype, you can explore:
If you encounter any issues, please open an issue on our GitHub repository.