R/DGE_analysis.R
DGE_analysis.Rd
Perform differential expression analysis across cell types based on single cell data with Pseudobulk approach. Returns the results of this differential expression analysis
DGE_analysis(
SCE,
design,
pseudobulk_ID,
celltype_ID,
y = NULL,
region = "single_region",
coef = NULL,
control = NULL,
pval_adjust_method = "BH",
adj_pval = 0.05,
folder = "sc.cell.type.de.graphs/",
rmv_zero_count_genes = TRUE,
verbose = F
)
SingleCellExperiment object, a specialised S4 class for storing data from single-cell experiments. A location of an R, rds or qs file to be loaded can also be passed. If using R objects make sure SCE object saved with the name SCE
Design formula of class type formula
. Equation used to fit the model- data for the generalised linear model e.g. expression ~ sex + pmi + disease.
Column name in the SCE object to perform pseudobulk on, usually the patient identifier. This column is used for grouping in the pseudobulk approach
Column name in the SCE object for the cell type variable. This is used to identify celltype specific DEGs. If there is only one cell type in the analysis, the analysis will be automatically updated accordingly.
Column name in the SCE object for the return variable e.g. "diagnosis" - Case or disease. Default is the last variable in the design formula. y can be discrete (logisitc regression) or continuous (linear regression)
Column name in the SCE object for the study region. If there are multiple regions in the study (for example two brain regions). Pseudobulk values can be derived separately. Default is "single_region" which will not split by region.
Character specifying which level to investigate for the differential expression analysis e.g. in a case/control study use "case" if case is the identifier in the y column to get positive fold changes to relate to case samples. leave as default value for continuous y. Default is NULL.
Character specifying which control level for the differential expression analysis e.g. in a case/control/other study use "control" in the y column to compare against. NOTE only need to specify if more than two groups in y, leave as default value for two groups or continuous y. Default is NULL.
Adjustment method for the p-value in the differential expression analysis. Default is benjamini hochberg "BH". See stats::p.adjust for available options
Adjusted p-value cut-off for the differential expression analysis, 0-1 range
Folder where the graphs from the differential expression analysis are saved. Default will create a folder in the current working directory "sc.cell.type.de.graphs". False will skip plotting.
Whether genes with no count values in any cell should be removed. Default is TRUE
Logical indicating if extra information about the differential expression analysis should be printed
A list containing: celltype_DEGs: list with the differentially expressed genes (DEGs) for each cell type celltype_all_genes: list with all genes along with their differential expression scores for each cell type celltype_counts: vector with the counts of cells after QC in each cell type
if (FALSE) { # \dontrun{
# To use the function input the formula of the comparison you want to make along with the names
# of the pseudobulk ID and celltype ID
# If you want to compare disease and control across cell types, taking into account sex use:
# formula = ~ sex + disease. Firstly load your SCE object (can otherwise specify location)
library(qs)
library(SingleCellExperiment)
SCE <- qs::qread("sce.qs")
DGE_analysis.return_incl_sex<- DGE_analysis(SCE_small,
design= ~ sex + pathological_diagnosis_original,
pseudobulk_ID="sample_id", celltype_ID="allan_celltype",coef="AD")
# If you only want to also account for other variables as well as sex, such as
# postmortem interval (PMI):
DGE_analysis.return_sex_pmi<- DGE_analysis(SCE_small,
design= ~ sex + PMI + pathological_diagnosis_original,
pseudobulk_ID="sample_id", celltype_ID="allan_celltype",coef="AD")
# If you only want to compare disease and control across cell types w/o any other constraints:
DGE_analysis.return<- DGE_analysis(SCE_small,design= ~ pathological_diagnosis_original,
pseudobulk_ID="sample_id", celltype_ID="allan_celltype",coef="AD")
# A good way to validate the DEG analysis approach is to run a sex comparison
# You would expect genes on the sex chromosomes to be DEGs, this can be done:
DGE_analysis.sex.return<- DGE_analysis(SCE,design= ~ sex,
pseudobulk_ID="sample_id", celltype_ID="allan_celltype",coef="M")
#get gene names of chromosomal DEGs
library('biomaRt')
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
genes <- unique(as.character(our_degs$name))
gene_IDs <- getBM(filters= "ensembl_gene_id",
attributes= c("ensembl_gene_id","chromosome_name","hgnc_symbol"),
values = genes, mart= mart,useCache = FALSE)
gene_IDs <- as.data.table(gene_IDs)
setnames(gene_IDs,"ensembl_gene_id","name")
DEGs <- data.table::rbindlist(sc.cell.type.de.return$celltype_DEGs,idcol="celltype")
setkey(DEGs,name)
#append gene names
DEGs[, gene_name := gene_IDs[DEGs, on=.(name), x.hgnc_symbol]]
DEGs[, chromosome_name := gene_IDs[DEGs, on=.(name), x.chromosome_name]]
#Xist should be there, key indicator of correct DEG analysis
DEGs[gene_name=="XIST",c("celltype","adj_pval","lfc")]
#Present in all cell types
# There should also be high number of other DEGs on sex chromosomes, check:
our_degs[chromosome_name %in% c("X","Y"),
.N,by=.(celltype,chromosome_name)]
} # }