## ===================== 🦠🌋🦠 Welcome to MAGMA.Celltyping 🦠🌋🦠 =====================
## This package uses MAGMA:
## https://ctg.cncr.nl/software/magma
## 
## To cite MAGMA.Celltyping, please use:
## * Skene, N.G., Bryois, J., Bakken, T.E. et al. Genetic identification of
##      brain cell types underlying schizophrenia. Nat Genet 50, 825-833 (2018).
##      https://doi.org/10.1038/s41588-018-0129-5
## * de Leeuw CA, Mooij JM, Heskes T, Posthuma D (2015) MAGMA: Generalized
##      Gene-Set Analysis of GWAS Data. PLOS Computational Biology 11(4): e1004219.
##      https://doi.org/10.1371/journal.pcbi.1004219
## 
## Please report any bugs or feature requests by filling out an Issues template:
##      https://github.com/neurogenomics/MAGMA_Celltyping/issues
## ===================== 🦠🌋🦠 =========================== 🦠🌋🦠 =====================
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Intro

MAGMA.Celltyping is a software package that facilitates conducting cell-type-specific enrichment tests on GWAS summary statistics.

The following vignette shows the step-by-step version of the MAGMA.Celltyping pipeline. For the newer streamlined version of the pipeline (that wraps many of these functions), please see the Getting Started vignette.

Prepare data

Import GWAS

  • We need to have a GWAS summary statistics file to analyse as input.
  • As an example, you can download UK Biobank summary statistics for ‘fluid_intelligence’ using get_example_gwas().
gwas_sumstats_path <- MAGMA.Celltyping::get_example_gwas(
  trait = "fluid_intelligence",
  munged = FALSE)

Munge GWAS

Our lab have created a robust Bioconductor package for formatting multiple types of summary statistics files: MungeSumstats (please cite the associated publication if you use this package):

Murphy, Schilder, & Skene, MungeSumstats: a Bioconductor package for the standardization and quality control of many GWAS summary statistics, Bioinformatics, Volume 37, Issue 23, 1 December 2021, Pages 4593–4596, https://doi.org/10.1093/bioinformatics/btab665

  • We will use MungeSumstats here to reformat the summary statistics for Fluid Intelligence. The minimum info needed after munging is:
  • SNP, CHR, BP as first three columns.
  • It has at least one of these columns: (“Z”,“OR”,“BETA”,“LOG_ODDS”,“SIGNED_SUMSTAT”)

Genome builds:

  • The UK Biobank data from Ben Neale uses “GRCh37” so we will specify this.
  • However, if you don’t know which genome build you GWAS sum stats are in, MungeSumstats will infer this information by default (simply set ref_genome=NULL).
path_formatted <- MungeSumstats::format_sumstats(path=gwas_sumstats_path,
                                                 save_path = tempfile(fileext = ".formatted.tsv.gz"),
                                                 ref_genome ="GRCh37")

For the sake of this example, we also provide a pre-munged version of the above file. If you use this, you can skip the previous two steps.

path_formatted_intelligence <- MAGMA.Celltyping::get_example_gwas(
    trait = "fluid_intelligence", 
    munged = TRUE)

Map SNPs to Genes

Next, we must convert the GWAS summary statistics file to a gene-level signature so that we can compare it to a gene-level transcriptomic cell-type reference.

Note you can input the genome build of your summary statistics for this step or it can be inferred if left NULL:

genesOutPath_intelligence <- MAGMA.Celltyping::map_snps_to_genes(
    path_formatted = gwas_sumstats_path,
    genome_build = "GRCh37")

CellTypeDataset

Import CTD

The second step in MAGMA.Celltyping is to prepare your Cell Type Dataset (CTD) reference, from which we will get gene signatures for each cell-type in the reference.

The ewceData package comes with a CTD which we use as an example.

If you want to import your own single cell RNA-seq dataset, then this needs converting into CTD format; please see the EWCE vingette for explanation of how to do this.

ctd <- ewceData::ctd()

Note that the cell type dataset loaded in the code above is the Karolinksa cortex/hippocampus data only. For the full Karolinska dataset with hypothalamus and midbrain instead use the following:

ctd <- MAGMA.Celltyping::get_ctd("ctd_allKI")

You can find a list of other datasets provided by this package using ?MAGMA.Celltyping::get_ctd:

ctd <- get_ctd("ctd_Tasic")
ctd <- get_ctd("ctd_DivSeq")
ctd <- get_ctd("ctd_AIBS")
ctd <- get_ctd("ctd_DRONC_human")
ctd <- get_ctd("ctd_DRONC_mouse")
ctd <- get_ctd("ctd_BlueLake2018_FrontalCortexOnly")
ctd <- get_ctd("ctd_BlueLake2018_VisualCortexOnly")
ctd <- get_ctd("ctd_Saunders")
ctd <- get_ctd("ctd_Zeisel2018")

Prepare quantiles

Preparing quantiles in the CTD is now something that is automatically handled by by MAGMA.Celltyping so the user doesn’t need to do this themselves as a pre-step anymore.

However, we still export the function prepare_quantile_groups in case users want to have more control.

prepare_quantile_groups calculates the quantile groups for each celltype within the single-cell dataset. If your single-cell dataset is not from mouse, then change the ctd_species argument.

ctd <- MAGMA.Celltyping::prepare_quantile_groups(ctd = ctd,
                                                 ctd_species="mouse")
# Examine how the quantile groups look
print(ctd[[1]]$specificity_quantiles[c("Gfap","Dlg4","Aif1"),])
print(table(ctd[[1]]$specificity_quantiles[,1]))

Run the main cell type association analysis

The analyses can be run in either linear or top10% enrichment modes. Each have their own advantages/disadvantages.

  • Linear mode: Performs a linear regression analysis using continuous gene-association scores from the GWAS MAGMA results and continuous specificity scores from each cell-type in the CTD reference. This mode takes advantage of gene-level weights in order to get more accurate association estimates. However, this mode also tends to give more significant results in general, as it may be capturing signatures from multiple cell-types at once.
  • Top 10% mode: Performs gene-set enrichment analysis (GSEA) using only the top 10% most cell-type-specific genes for each cell-type (i.e. genes in the top specificity decile). Unlike linear mode, top 10% mode discards any gene-level weight data during testing and instead treats genes as categorical data (in the gene list or not in the gene list). While you do lose the nuance of the gene-level weights, you potential capture less cell-type signatures in a given test (due to only using a handful of top genes), which can make interpretation easier.

Let’s start with linear:

ctAssocsLinear <- MAGMA.Celltyping::calculate_celltype_associations(
  ctd = ctd,
  gwas_sumstats_path = path_formatted_intelligence, 
  ctd_species = "mouse")

FigsLinear <- MAGMA.Celltyping::plot_celltype_associations(
  ctAssocs = ctAssocsLinear,
  ctd = ctd)

Now let’s add the top 10% mode

ctAssocsTop <- MAGMA.Celltyping::calculate_celltype_associations(
  ctd = ctd,
  gwas_sumstats_path = path_formatted_intelligence, 
  EnrichmentMode = "Top 10%")

FigsTopDecile <- MAGMA.Celltyping::plot_celltype_associations(
  ctAssocs = ctAssocsTop,
  ctd = ctd)

Then plot linear together with the top 10% mode.

ctAssocMerged <- MAGMA.Celltyping::merge_magma_results(
  ctAssoc1 = ctAssocsLinear,
  ctAssoc2 = ctAssocsTop)

FigsMerged <- MAGMA.Celltyping::plot_celltype_associations(
  ctAssocs = ctAssocMerged,
  ctd = ctd)

Conditional analysis (linear mode)

Sometimes a single cell-type can dominate the signature in a given GWAS (e.g. pyramidal cells in Schizophrenia). This makes it difficult to disentangle whether other cell-types are truly associated with the GWAS trait, or whether the are simply enriched because they have an overlapping signature with the top cell-type. Conditional mode allows you to control for one (or several) cell-type(s) and see whether other cell-types still remain significantly enriched.

By default, it is assumed that you want to run the linear enrichment analysis. There are two modes for conditional analyses, you can either control for the top N cell types from the baseline analysis (in which case, set controlTopNcells) or control for specific specified cell types (in which case, set controlledCTs).

ctCondAssocs <- MAGMA.Celltyping::calculate_conditional_celltype_associations(
  ctd = ctd,
  gwas_sumstats_path = path_formatted_intelligence, 
  analysis_name = "Conditional",
  controlTopNcells= 2)

if(!is.null(ctCondAssocs)){ 
    MAGMA.Celltyping::plot_celltype_associations(ctAssocs = ctCondAssocs,
                                                 ctd = ctd)
}

Let’s try as an alternative to control for expression of 3 neuronal subtypes at the same time:

ctCondAssocs <- MAGMA.Celltyping::calculate_conditional_celltype_associations(
  ctd = ctd,
  gwas_sumstats_path = path_formatted_intelligence, 
  analysis_name = "Conditional",
  controlledCTs = c("pyramidal CA1","pyramidal SS","interneurons"),
  controlledAnnotLevel = 1)

MAGMA.Celltyping::plot_celltype_associations(ctAssocs = ctCondAssocs,
                                             ctd=ctd)

Note that Periventricular Microglia (PVM) go from totally non-significant to significant once the neurons are controlled for. Test if this change is significant as follows:

magma1 = ctCondAssocs[[2]]$results[ctCondAssocs[[2]]$results$CONTROL=="BASELINE",]
magma2 = ctCondAssocs[[2]]$results[ctCondAssocs[[2]]$results$CONTROL=="interneurons,pyramidal_CA1,pyramidal_SS",]

resCompared <- MAGMA.Celltyping:::compare_trait_enrichments(magma1=magma1,
                                                            magma2=magma2,
                                                            annotLevel=2,
                                                            ctd=ctd)
resCompared[1:3,]

Using this approach we can see that the increased enrichment is microglia in the controlled analysis is almost significantly increased relative to the baseline analysis.

Conditional analyses (top 10% mode)

Conditional analyses can also be performed with top 10% mode (although the conditioning is done in linear mode).

ctCondAssocsTopTen = MAGMA.Celltyping::calculate_conditional_celltype_associations(
  ctd = ctd,
  gwas_sumstats_path = path_formatted_intelligence, 
  analysis_name = "Conditional",
  controlledCTs=c("pyramidal CA1","pyramidal SS","interneurons"),
  controlledAnnotLevel=1,
  EnrichmentMode = "Top 10%")

MAGMA.Celltyping::plot_celltype_associations(ctAssocs = ctCondAssocsTopTen,
                                             ctd = ctd)

Controlling for a second GWAS

We now want to test enrichments that remain in a GWAS after we control for a second GWAS. So let’s download a second GWAS sumstats file and prepare it for analysis.

20016.assoc.tsv is the sumstats file for ‘Fluid Intelligence Score’ from the UK Biobank.

So let’s subtract genes associated with prospective memory from those involved in fluid intelligence.

Download and prepare the ‘Prospective memory’ GWAS summary statistics

#### Download pre-munged GWAS summary stats ####
gwas_sumstats_path_memory <- MAGMA.Celltyping::get_example_gwas(
    trait = "prospective_memory",
    munged = FALSE)
#### Format summary stats ####
## and infer the genome build from the data whilst doing so (see output messages)
path_formatted_memory <- MungeSumstats::format_sumstats(
    path = gwas_sumstats_path_memory)

#### Can also skip ahead by importing pre-munged summary stats ####
# path_formatted_memory <- MAGMA.Celltyping::get_example_gwas(
#     trait = "prospective_memory",
#     munged = TRUE)

#### Map SNPs to genes ####
genesOutPath_memory <- MAGMA.Celltyping::map_snps_to_genes(
    path_formatted = path_formatted_memory,
    genome_build = "GRCH37")

Check which cell types this GWAS is associated with at baseline

ctAssocsLinear_memory <- MAGMA.Celltyping::calculate_celltype_associations(
  ctd = ctd,
  gwas_sumstats_path = path_formatted_memory, 
  ctd_species = "mouse")

ctAssocsLinear_intelligence <- MAGMA.Celltyping::calculate_celltype_associations(
  ctd = ctd,
  gwas_sumstats_path = path_formatted_intelligence, 
  ctd_species = "mouse")

MAGMA.Celltyping::plot_celltype_associations(ctAssocs = ctAssocsLinear_memory,
                                             ctd = ctd)

Compare enrichments in the two GWAS using a tile plot

ctAssocMerged_MemInt <- MAGMA.Celltyping::merge_magma_results(
  ctAssoc1 = ctAssocsLinear_memory,
  ctAssoc2 = ctAssocsLinear_intelligence)

FigsMerged_MemInt <- lapply(seq_len(length(ctd)), 
                            function(lvl){
    MAGMA.Celltyping::magma_tileplot(ctd=ctd,
                          results=ctAssocMerged_MemInt[[lvl]]$results,
                          annotLevel=lvl,
                          fileTag=paste0("Merged_MemInt_lvl",lvl))
}) 

Check which cell types ‘Fluid Intelligence’ is associated with after controlling for ‘Prospective memory’

# Set paths for GWAS sum stats + .genes.out file (with the z-scores)  
ctAssocsLinear <- MAGMA.Celltyping::calculate_celltype_associations(
  ctd = ctd,
  ctd_species = "mouse", 
  gwas_sumstats_path = path_formatted_intelligence, 
  genesOutCOND = genesOutPath_memory,
  analysis_name = "ControllingForPropMemory")

FigsLinear <- MAGMA.Celltyping::plot_celltype_associations(
    ctAssocs = ctAssocsLinear,
    ctd = ctd,
    fileTag = "ControllingForPropMemory")

We find that after controlling for prospective memory, there is no significant enrichment left associated with fluid intelligence.

Calculate cell type enrichments directly (using linear model)

Instead of relying on the tool MAGMA (which requires an extra installation step), we can also compute cell-type enrichment scores in an R-native re-implementation of the MAGMA enrichment analyses using the following functions:

ctd <- ewceData::ctd()
magma_GenesOut_file <- MAGMA.Celltyping::import_magma_files(
    file_types = "genes.out",
    return_dir = FALSE)

magmaAdjZ <- MAGMA.Celltyping:: adjust_zstat_in_genesOut(
  ctd = ctd,
  ctd_species = "mouse",
  magma_GenesOut_file = magma_GenesOut_file)

output <- MAGMA.Celltyping:: calculate_celltype_enrichment_limma(
  magmaAdjZ = magmaAdjZ,
  ctd = ctd,
  thresh = 0.0001,
  ctd_species = "mouse",
  annotLevel = 2)

We can then get the probability of the celltype being enriched as follows

print(sort(output))

The results should closely resemble those obtained using MAGMA

Gene set enrichments

To test whether a gene set is enriched using MAGMA the following commands can be used:

data("rbfox_binding")
 
geneset_res <- MAGMA.Celltyping::calculate_geneset_enrichment(
  geneset = rbfox_binding,
  gwas_sumstats_path = path_formatted_intelligence,
  analysis_name = "Rbfox_20016",
  geneset_species = "mouse")
print(geneset_res)

We can then test whether the gene set is still enriched after controlling for celltype enrichment:

ctd <- MAGMA.Celltyping::get_ctd("ctd_allKI")

analysis_name <- "Rbfox_16_pyrSS"
controlledCTs <- c("pyramidal SS")

cond_geneset_res_pyrSS = MAGMA.Celltyping::calculate_conditional_geneset_enrichment(
  geneset = geneset,
  ctd = ctd,
  controlledAnnotLevel = 1,
  controlledCTs = controlledCTs,
  gwas_sumstats_path = gwas_sumstats_path,
  analysis_name = analysis_name, 
  ctd_species = "mouse")

controlledCTs <- c("pyramidal CA1")
cond_geneset_res_pyrCA1 <- MAGMA.Celltyping::calculate_conditional_geneset_enrichment(
  geneset = geneset,
  ctd = ctd,
  controlledAnnotLevel = 1,
  controlledCTs = controlledCTs,
  gwas_sumstats_path = gwas_sumstats_path,
  analysis_name = analysis_name, 
  ctd_species = "mouse")

controlledCTs <- c("pyramidal CA1","pyramidal SS")
cond_geneset_res_pyr = MAGMA.Celltyping::calculate_conditional_geneset_enrichment(
  geneset = geneset,
  ctd = ctd,
  controlledAnnotLevel = 1,
  controlledCTs = controlledCTs,
  gwas_sumstats_path = gwas_sumstats_path,
  analysis_name = analysis_name, 
  ctd_species = "mouse")

controlledCTs <- c("Medium Spiny Neuron")
cond_geneset_res_MSN = MAGMA.Celltyping::calculate_conditional_geneset_enrichment(
  geneset = geneset,
  ctd = ctd,
  controlledAnnotLevel = 1,
  controlledCTs = controlledCTs,
  gwas_sumstats_path = gwas_sumstats_path,
  analysis_name = analysis_name, 
  ctd_species = "mouse")

controlledCTs <- c("Medium Spiny Neuron","pyramidal CA1","pyramidal SS","interneurons")
cond_geneset_res = MAGMA.Celltyping::calculate_conditional_geneset_enrichment(
  geneset = geneset,
  ctd = ctd,
  controlledAnnotLevel = 1,
  controlledCTs = controlledCTs,
  gwas_sumstats_path = gwas_sumstats_path,
  analysis_name = analysis_name, 
  ctd_species = "mouse")

Session Info

utils::sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_1.1.2             MAGMA.Celltyping_2.0.11 BiocStyle_2.29.1       
## 
## loaded via a namespace (and not attached):
##   [1] splines_4.3.1                 later_1.3.1                  
##   [3] BiocIO_1.11.0                 bitops_1.0-7                 
##   [5] ggplotify_0.1.2               filelock_1.0.2               
##   [7] tibble_3.2.1                  R.oo_1.25.0                  
##   [9] XML_3.99-0.14                 lifecycle_1.0.3              
##  [11] rstatix_0.7.2                 rprojroot_2.0.3              
##  [13] MASS_7.3-60                   lattice_0.21-8               
##  [15] backports_1.4.1               magrittr_2.0.3               
##  [17] limma_3.57.7                  plotly_4.10.2                
##  [19] sass_0.4.7                    rmarkdown_2.24               
##  [21] jquerylib_0.1.4               yaml_2.3.7                   
##  [23] httpuv_1.6.11                 HGNChelper_0.8.1             
##  [25] minqa_1.2.5                   DBI_1.1.3                    
##  [27] abind_1.4-5                   zlibbioc_1.47.0              
##  [29] GenomicRanges_1.53.1          purrr_1.0.2                  
##  [31] R.utils_2.12.2                BiocGenerics_0.47.0          
##  [33] RCurl_1.98-1.12               yulab.utils_0.0.7            
##  [35] VariantAnnotation_1.47.1      rappdirs_0.3.3               
##  [37] GenomeInfoDbData_1.2.10       IRanges_2.35.2               
##  [39] S4Vectors_0.39.1              tidytree_0.4.5               
##  [41] pkgdown_2.0.7                 codetools_0.2-19             
##  [43] DelayedArray_0.27.10          xml2_1.3.5                   
##  [45] tidyselect_1.2.0              aplot_0.2.0                  
##  [47] lme4_1.1-34                   matrixStats_1.0.0            
##  [49] stats4_4.3.1                  BiocFileCache_2.9.1          
##  [51] GenomicAlignments_1.37.0      jsonlite_1.8.7               
##  [53] ellipsis_0.3.2                systemfonts_1.0.4            
##  [55] tools_4.3.1                   progress_1.2.2               
##  [57] treeio_1.25.3                 ragg_1.2.5                   
##  [59] Rcpp_1.0.11                   glue_1.6.2                   
##  [61] SparseArray_1.1.11            xfun_0.40                    
##  [63] MatrixGenerics_1.13.1         GenomeInfoDb_1.37.2          
##  [65] RNOmni_1.0.1                  BiocManager_1.30.22          
##  [67] fastmap_1.1.1                 boot_1.3-28.1                
##  [69] fansi_1.0.4                   digest_0.6.33                
##  [71] R6_2.5.1                      mime_0.12                    
##  [73] gridGraphics_0.5-1            textshaping_0.3.6            
##  [75] colorspace_2.1-0              biomaRt_2.57.1               
##  [77] RSQLite_2.3.1                 R.methodsS3_1.8.2            
##  [79] utf8_1.2.3                    tidyr_1.3.0                  
##  [81] generics_0.1.3                data.table_1.14.8            
##  [83] rtracklayer_1.61.1            prettyunits_1.1.1            
##  [85] httr_1.4.7                    htmlwidgets_1.6.2            
##  [87] S4Arrays_1.1.5                pkgconfig_2.0.3              
##  [89] gtable_0.3.3                  blob_1.2.4                   
##  [91] SingleCellExperiment_1.23.0   XVector_0.41.1               
##  [93] htmltools_0.5.6               carData_3.0-5                
##  [95] bookdown_0.35                 scales_1.2.1                 
##  [97] Biobase_2.61.0                png_0.1-8                    
##  [99] ggdendro_0.1.23               ggfun_0.1.2                  
## [101] knitr_1.43                    reshape2_1.4.4               
## [103] rjson_0.2.21                  nloptr_2.0.3                 
## [105] nlme_3.1-163                  curl_5.0.2                   
## [107] cachem_1.0.8                  stringr_1.5.0                
## [109] BiocVersion_3.18.0            parallel_4.3.1               
## [111] AnnotationDbi_1.63.2          restfulr_0.0.15              
## [113] desc_1.4.2                    pillar_1.9.0                 
## [115] grid_4.3.1                    vctrs_0.6.3                  
## [117] promises_1.2.1                ggpubr_0.6.0                 
## [119] car_3.1-2                     dbplyr_2.3.3                 
## [121] xtable_1.8-4                  evaluate_0.21                
## [123] orthogene_1.7.0               GenomicFeatures_1.53.1       
## [125] cli_3.6.1                     compiler_4.3.1               
## [127] Rsamtools_2.17.0              rlang_1.1.1                  
## [129] crayon_1.5.2                  grr_0.9.5                    
## [131] ggsignif_0.6.4                gprofiler2_0.2.2             
## [133] EWCE_1.9.2                    plyr_1.8.8                   
## [135] fs_1.6.3                      stringi_1.7.12               
## [137] viridisLite_0.4.2             ewceData_1.9.0               
## [139] BiocParallel_1.35.4           assertthat_0.2.1             
## [141] babelgene_22.9                munsell_0.5.0                
## [143] Biostrings_2.69.2             gh_1.4.0                     
## [145] lazyeval_0.2.2                homologene_1.4.68.19.3.27    
## [147] Matrix_1.6-1                  ExperimentHub_2.9.1          
## [149] MungeSumstats_1.9.15          BSgenome_1.69.0              
## [151] hms_1.1.3                     patchwork_1.1.3              
## [153] bit64_4.0.5                   ggplot2_3.4.3                
## [155] KEGGREST_1.41.0               statmod_1.5.0                
## [157] shiny_1.7.5                   SummarizedExperiment_1.31.1  
## [159] interactiveDisplayBase_1.39.0 AnnotationHub_3.9.1          
## [161] googleAuthR_2.0.1             gargle_1.5.2                 
## [163] broom_1.0.5                   memoise_2.0.1                
## [165] bslib_0.5.1                   ggtree_3.9.1                 
## [167] bit_4.0.5                     ape_5.7-1