Perform robust power analysis for differential gene expression in scRNA-seq dataset

Run the complete power analysis pipeline by downsampling individuals and cells, performing differential expression analysis, and generating power plots.

power_analysis(
  SCE,
  range_downsampled_individuals = "placeholder",
  range_downsampled_cells = "placeholder",
  output_path = getwd(),
  sampleID = "donor_id",
  design = "placeholder",
  sexID = "sex",
  celltypeID = "cell_type",
  assay_name = "counts",
  coef = "male",
  fdr = 0.05,
  nom_pval = 0.05,
  Nperms = 20,
  y = NULL,
  region = "single_region",
  control = NULL,
  pval_adjust_method = "BH",
  rmv_zero_count_genes = TRUE,
  abs_effect_size_thresholds = "placeholder",
  upreg_effect_size_thresholds = "placeholder",
  downreg_effect_size_thresholds = "placeholder"
)

Arguments

SCE: A SingleCellExperiment object containing the input scRNA-seq data. You may also provide a path to an .R, .rds, or .qs file. If using a file, ensure the SCE object inside is named SCE.
range_downsampled_individuals: A numeric vector specifying the number of individuals to include at each downsampling level, in ascending order (e.g., c(10, 20, 30)). By default, 12 evenly spaced values are generated from 0 to the total number of samples, each rounded up to the nearest multiple of 5.
range_downsampled_cells: A numeric vector specifying the number of cells per individual to include at each downsampling level, in ascending order (e.g., c(20, 40, 60)). By default, 11 evenly spaced values are generated from 0 to the 90th percentile of per-individual cell counts, each rounded to the nearest multiple of 5.
output_path: A directory path where DGE analysis outputs of down-sampled datasets and power plots will be saved.
sampleID: Name of the column in the SCE metadata that identifies biological replicates (e.g., patient ID). This column is used for grouping in the pseudobulk approach.
design: A model formula specifying covariates for differential expression analysis. It should be of class formula (e.g., ~ sex + pmi + disease). This formula is used to fit a generalized linear model.
sexID: Name of the column in the SCE metadata that encodes the sex of individuals. Default is "sex".
celltypeID: Name of the column in the SCE metadata indicating cell type labels. This is used to identify celltype specific DEGs.
assay_name: Name of the assay in the SCE object to use for analysis. Default is "counts", which uses the count assay in each SCE.
coef: Character string indicating the level of the response variable (y) to test for in differential expression. For case-control studies, this would typically be "case" (e.g. "AD"). Typically used in binary comparisons. Not required for continuous outcomes.
fdr: Adjusted p-value (False Discovery Rate) threshold for selecting significantly differentially expressed genes (DEGs). Only genes with adjusted p-values below this value will be retained. Default is 0.05.
nom_pval: Nominal (unadjusted) p-value threshold for selecting DEGs. Used as an alternative to FDR when preferred. Only genes with p-values below this cutoff will be retained. Default is 0.05.
Nperms: Number of subsets (permutations) to generate at each downsampling level during power analysis. Each subset is analyzed independently to estimate variability. Default is 20.
y: Name of the column in the SCE metadata representing the response variable (e.g., "diagnosis" - case or disease). If not specified, defaults to the last variable in the design formula. Accepts both categorical (logistic regression) and continuous (linear regression) variables.
region: Optional column in SCE metadata indicating the tissue or brain region. If present, differential expression is performed within each region separately. Defaults to "single_region" (i.e., no regional split).
control: Optional. Character string specifying the control level in the response variable (y) to compare against. Only required if y contains more than two levels. Ignored for binary or continuous outcomes.
pval_adjust_method: Method used to adjust p-values for multiple testing. Default is "BH" (Benjamini–Hochberg). See stats::p.adjust for available options.
rmv_zero_count_genes: Logical. Whether to remove genes with zero counts across all cells. Default is TRUE.
abs_effect_size_thresholds: Optional. Numeric vector of effect size (absolute logFC) thresholds to use for power analysis. If not provided, defaults to 25th, 50th and 75h percentiles of the absolute logFCs. Must contain non-negative, increasing values.
upreg_effect_size_thresholds: Optional. Numeric vector of effect size thresholds to use for power analysis (for up-regulated DEGs). If not provided, defaults to 25th, 50th and 75h percentiles of the positive logFCs. Must contain non-negative, increasing values.
downreg_effect_size_thresholds: Optional. Numeric vector of effect size thresholds to use for power analysis (for down-regulated DEGs). If not provided, defaults to 25th, 50th and 75h percentiles of the negative logFCs. Must contain negative (or zero), increasing values. Saves all plots and DGE analysis outputs in the appropriate directories

Examples

if (FALSE) { # \dontrun{
# Too slow to run with check()
# 1. Prepare SCE
micro_tsai <- system.file("extdata", "Tsai_Micro.qs", package="poweranalysis")
SCE_tsai <- qs::qread(micro_tsai)

# 2. Run Power Analysis
PA_tsai <- poweranalysis::power_analysis(
    SCE_tsai,
    sampleID = "sample_id",
    celltypeID = "cluster_celltype",
    design = ~ sex,
    coef = "M",
    output_path = tempdir()
)
PA_tsai
} # }