Follows conversation on GitHub here.
You can think of each phenotype-disease combination as its own “symptom”, each with semi-overlapping gene sets.
If the disease where the gene lists came from does not matter, the mean correlations within groups of symptoms (belonging to a particular phenotype) should be high. If they are low, it means the symptoms are very different from one another (despite all being associated with the same phenotype).
results <- MultiEWCE::load_example_results()
## Registered S3 method overwritten by 'ggtree':
## method from
## fortify.igraph ggnetwork
message("N phenotypes: ",formatC(length(unique(results$HPO_ID)),big.mark = ","))
## N phenotypes: 2,831
annot <- HPOExplorer::load_phenotype_to_genes(1)
## Importing existing file: ... phenotype_to_genes.txt
annot[,HPO_ID.DatabaseID:=paste(HPO_ID,LinkID,sep=".")]
message("N symptoms: ",formatC(length(unique(annot$HPO_ID.DatabaseID)),big.mark = ","))
## N symptoms: 621,906
Get only symptoms with >4 genes each.
gene_counts <- annot[,list(n=length(unique(Gene))),
by="HPO_ID.DatabaseID"]
gene_counts <- cbind(gene_counts,
data.table::data.table(stringr::str_split(gene_counts$HPO_ID.DatabaseID,"\\.",
simplify = TRUE)) |> `names<-`(c("HPO_ID","DatabaseID")))
gene_counts_valid <- gene_counts[n>=4]
annot <- annot[HPO_ID.DatabaseID %in% gene_counts_valid$HPO_ID.DatabaseID,]
message(round(nrow(gene_counts_valid)/nrow(gene_counts)*100,2),"%",
" of HPO_IDs remain after filtering by symptom gene lists.")
## 5.39% of HPO_IDs remain after filtering by symptom gene lists.
X <- HPOExplorer::hpo_to_matrix(phenotype_to_genes = annot,
formula = "Gene ~ HPO_ID.DatabaseID")
## Constructing HPO gene x phenotype matrix.
For each phenotype, compute the mean correlation between each of its symptoms.
#### Parse names ####
name_map <- cbind(HPO_ID.DatabaseID=colnames(X),
data.table::data.table(stringr::str_split(colnames(X),"\\.",
simplify = TRUE)) |> `names<-`(c("HPO_ID","DatabaseID"))) |>
data.table::setkeyv(cols = "HPO_ID.DatabaseID")
#### Aggregate corr/phenotype group ####
group_cor <- lapply(stats::setNames(unique(name_map$HPO_ID),
unique(name_map$HPO_ID)),
function(id){
idx <- name_map[HPO_ID==id]$HPO_ID.DatabaseID
X_sub <- X[,idx, drop=FALSE]
#### Get number of overlapping genes #####
rs <- Matrix::rowSums(X_sub)
X_sub <- X_sub[rs>0,]
intersect_size <- sum(rs==ncol(X_sub))
union_size <- sum(rs>0)
max_overlap <- max(rs, na.rm = TRUE)
#### Compute corr ####
X_cor <- WGCNA::cor(x = X_sub)
diag(X_cor) <- NA
rm <- Matrix::rowMeans(X_cor, na.rm = TRUE)
data.table::data.table(
intersect_size=intersect_size,
union_size=union_size,
max_overlap=max_overlap,
cor_mean=mean(rm, na.rm=TRUE),
cor_sd=sd(rm, na.rm = TRUE)
)
}) |> data.table::rbindlist(use.names = TRUE, idcol = "HPO_ID")
##
hist(group_cor$cor_mean, 50)
I found that the symptom gene lists are not very strongly correlated with one another. Since the presence/absence of a gene is binary, correlation basically equates to “proportion of genes that overlap”.
That said, even given the low correlation between symptoms within a given phenotype, there are some drawbacks to this symptom-focused approach:
So theoretical debates aside, it seems that the benefit of aggregating gene lists to the level of phenotypes is that it increases our coverage of testable HPO terms (at least with EWCE) and decreases our multiple testing burden.
summary(group_cor$cor_mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -1.0000 -0.4964 -0.1983 -0.3421 -0.0500 0.6703 1460
Which diseases contributed the genes to a particular celltype-phenotype association?
To answer this, I computed a simply gene overlap-based enrichment test for all symptoms, defined as HPO_ID + DiseaseID. This meant: 747,584 symptoms x 77 celltypes = 57,563,968 tests.
gene_data <- HPOExplorer::load_phenotype_to_genes()
gene_data[,HPO_ID.LinkID:=paste(HPO_ID,LinkID,sep=".")]
symptoms <- gen_overlap(gene_data = gene_data,
list_name_column = "HPO_ID.LinkID",
cores=30)
g <- HPOExplorer::load_phenotype_to_genes()
## Importing existing file: ... phenotype_to_genes.txt
res <- MultiEWCE::load_example_results("Descartes_All_Results_extras.rds")
res <- res[q<0.05]
symp <- MultiEWCE::load_example_results("gen_overlap.symptoms.rds")
report <- function(symp,
annot,
symp_var,
annot_var=symp_var){
message(paste(formatC(nrow(symp),big.mark = ","),
"results remain."))
n1 <- length(unique(symp[[symp_var]]))
n2 <- length(unique(annot[[annot_var]]))
message(paste(n1,"/",n2,symp_var,
paste0("(",round((n1/n2)*100,2),"%)"),
"remain."))
}
First, we remove all tests that had 0 overlap, since we can safely rule those out. This still leaves us with >10M results.
symp <- symp[intersection_size>0]
#### Split IDs apart again ####
symp[,c("HPO_ID","LinkID"):=data.table::tstrsplit(HPO_ID.LinkID,"\\.")]
#### Sort by pval then odds.ratio ####
data.table::setorderv(symp,c("pval","odds.ratio"), c(1,-1))
report(symp, g, symp_var="HPO_ID")
## 10,080,094 results remain.
## 10598 / 9677 HPO_ID (109.52%) remain.
report(symp, g, symp_var="LinkID")
## 10,080,094 results remain.
## 8119 / 7839 LinkID (103.57%) remain.
Next, we try filtering by nominal p-values, leaving 684,697 results.
psig <- symp[pval<0.05,]
report(psig, g, symp_var="HPO_ID")
## 684,697 results remain.
## 8315 / 9677 HPO_ID (85.93%) remain.
report(psig, g, symp_var="LinkID")
## 684,697 results remain.
## 3002 / 7839 LinkID (38.3%) remain.
Using multiple-testing correction is too stringent due to the huge number of tests, and only leaves 14 unique diseases (<1%).
qsig <- symp[qval<0.05,]
report(qsig, g, symp_var="HPO_ID")
## 8,312 results remain.
## 864 / 9677 HPO_ID (8.93%) remain.
report(qsig, g, symp_var="LinkID")
## 8,312 results remain.
## 14 / 7839 LinkID (0.18%) remain.
In either filtering scenario, we are left with less than a complete picture of the celltypes invovled in each symptom. So instead, let’s just take all of the ones with at least 1 overlapping genes. We can always filter again later.
Now we can merge back with our celltype-phenotype enrichment results. This gives us a 99.3% annotation rate for linking phenotypes back to diseases via celltype-specific mechanisms. It also gives us a new column “intersection” which lists the genes that are mediating that interaction.
res$CellType <- EWCE::fix_celltype_names(res$CellType, make_unique = FALSE)
res2 <- data.table::merge.data.table(res,
symp,
by=c("HPO_ID","CellType"))
report(res2, res, symp_var="HPO_ID")
## 1,106,181 results remain.
## 2812 / 2831 HPO_ID (99.33%) remain.
report(res2, g, symp_var="HPO_ID")
## 1,106,181 results remain.
## 2812 / 9677 HPO_ID (29.06%) remain.
report(res2, g, symp_var="LinkID")
## 1,106,181 results remain.
## 8024 / 7839 LinkID (102.36%) remain.
We’ll upload the subset of symptoms that were kept in the end to a new file for easy usage later.
symp_filt <- symp[HPO_ID.LinkID %in% unique(res2$HPO_ID.LinkID)]
tmp <- file.path(tempdir(),"gen_overlap.symptoms.filt.rds")
saveRDS(symp_filt, tmp)
piggyback::pb_upload(file = tmp,
repo = "neurogenomics/MultiEWCE",
tag = "v0.0.1")
cols <- c("pval","odds.ratio","jaccard","qval")
data.table::setnames(symp,cols, paste0("symptoms.",gsub("\\.","_",cols)))
res2 <- data.table::merge.data.table(res,
symp,
all.x = TRUE,
by=c("HPO_ID","CellType"))
tmp <- file.path(tempdir(),"Descartes_All_Results_extras.symptoms.rds")
saveRDS(res2, tmp)
piggyback::pb_upload(file = tmp,
repo = "neurogenomics/MultiEWCE",
tag = "v0.0.1")
sessioninfo::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.2.1 (2022-06-23)
## os macOS Big Sur ... 10.16
## system x86_64, darwin17.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Europe/London
## date 2023-03-22
## pandoc 3.1 @ /usr/local/bin/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## abind 1.4-5 2016-07-21 [1] CRAN (R 4.2.0)
## AnnotationDbi 1.60.2 2023-03-10 [1] Bioconductor
## AnnotationHub 3.6.0 2022-11-01 [1] Bioconductor
## ape 5.7-1 2023-03-13 [1] CRAN (R 4.2.0)
## aplot 0.1.10 2023-03-08 [1] CRAN (R 4.2.0)
## babelgene 22.9 2022-09-29 [1] CRAN (R 4.2.0)
## backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.0)
## base64enc 0.1-3 2015-07-28 [1] CRAN (R 4.2.0)
## Biobase 2.58.0 2022-11-01 [1] Bioconductor
## BiocFileCache 2.6.1 2023-02-17 [1] Bioconductor
## BiocGenerics 0.44.0 2022-11-01 [1] Bioconductor
## BiocManager 1.30.20 2023-02-24 [1] CRAN (R 4.2.0)
## BiocVersion 3.16.0 2022-05-05 [1] Bioconductor
## Biostrings 2.66.0 2022-11-01 [1] Bioconductor
## bit 4.0.5 2022-11-15 [1] CRAN (R 4.2.0)
## bit64 4.0.5 2020-08-30 [1] CRAN (R 4.2.0)
## bitops 1.0-7 2021-04-24 [1] CRAN (R 4.2.0)
## blob 1.2.3 2022-04-10 [1] CRAN (R 4.2.0)
## broom 1.0.4 2023-03-11 [1] CRAN (R 4.2.1)
## bslib 0.4.2 2022-12-16 [1] CRAN (R 4.2.1)
## cachem 1.0.7 2023-02-24 [1] CRAN (R 4.2.0)
## car 3.1-1 2022-10-19 [1] CRAN (R 4.2.1)
## carData 3.0-5 2022-01-06 [1] CRAN (R 4.2.0)
## caTools 1.18.2 2021-03-28 [1] CRAN (R 4.2.0)
## checkmate 2.1.0 2022-04-21 [1] CRAN (R 4.2.0)
## cli 3.6.0 2023-01-09 [1] CRAN (R 4.2.0)
## cluster 2.1.4 2022-08-22 [1] CRAN (R 4.2.0)
## coda 0.19-4 2020-09-30 [1] CRAN (R 4.2.0)
## codetools 0.2-19 2023-02-01 [1] CRAN (R 4.2.0)
## colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.2.1)
## crayon 1.5.2 2022-09-29 [1] CRAN (R 4.2.0)
## curl 5.0.0 2023-01-12 [1] CRAN (R 4.2.0)
## data.table 1.14.8 2023-02-17 [1] CRAN (R 4.2.0)
## DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.0)
## dbplyr 2.3.1 2023-02-24 [1] CRAN (R 4.2.0)
## DelayedArray 0.24.0 2022-11-01 [1] Bioconductor
## digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.0)
## doParallel 1.0.17 2022-02-07 [1] CRAN (R 4.2.0)
## dplyr 1.1.0 2023-01-29 [1] CRAN (R 4.2.1)
## dynamicTreeCut 1.63-1 2016-03-11 [1] CRAN (R 4.2.0)
## ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0)
## evaluate 0.20 2023-01-17 [1] CRAN (R 4.2.1)
## EWCE 1.7.2 2023-03-10 [1] Bioconductor
## ewceData 1.7.1 2023-02-21 [1] Github (neurogenomics/ewceData@7303707)
## ExperimentHub 2.6.0 2022-11-01 [1] Bioconductor
## fansi 1.0.4 2023-01-22 [1] CRAN (R 4.2.1)
## fastcluster 1.2.3 2021-05-24 [1] CRAN (R 4.2.0)
## fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.2.0)
## filelock 1.0.2 2018-10-05 [1] CRAN (R 4.2.0)
## foreach 1.5.2 2022-02-02 [1] CRAN (R 4.2.0)
## foreign 0.8-84 2022-12-06 [1] CRAN (R 4.2.0)
## Formula 1.2-5 2023-02-24 [1] CRAN (R 4.2.0)
## GeneOverlap 1.34.0 2022-11-01 [1] Bioconductor
## generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0)
## GenomeInfoDb 1.34.9 2023-02-02 [1] Bioconductor
## GenomeInfoDbData 1.2.9 2022-10-02 [1] Bioconductor
## GenomicRanges 1.50.2 2022-12-16 [1] Bioconductor
## ggfun 0.0.9 2022-11-21 [1] CRAN (R 4.2.0)
## ggnetwork 0.5.12 2023-03-06 [1] CRAN (R 4.2.1)
## ggplot2 3.4.1 2023-02-10 [1] CRAN (R 4.2.0)
## ggplotify 0.1.0 2021-09-02 [1] CRAN (R 4.2.0)
## ggpubr 0.6.0 2023-02-10 [1] CRAN (R 4.2.0)
## ggsignif 0.6.4 2022-10-13 [1] CRAN (R 4.2.0)
## ggtree 3.6.2 2022-11-10 [1] Bioconductor
## glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
## GO.db 3.16.0 2022-10-02 [1] Bioconductor
## gplots 3.1.3 2022-04-25 [1] CRAN (R 4.2.0)
## gprofiler2 0.2.1 2021-08-23 [1] CRAN (R 4.2.0)
## graph 1.76.0 2022-11-01 [1] Bioconductor
## gridExtra 2.3 2017-09-09 [1] CRAN (R 4.2.0)
## gridGraphics 0.5-1 2020-12-13 [1] CRAN (R 4.2.0)
## grr 0.9.5 2016-08-26 [1] CRAN (R 4.2.0)
## gtable 0.3.1 2022-09-01 [1] CRAN (R 4.2.0)
## gtools 3.9.4 2022-11-27 [1] CRAN (R 4.2.0)
## HGNChelper 0.8.1 2019-10-24 [1] CRAN (R 4.2.0)
## highr 0.10 2022-12-22 [1] CRAN (R 4.2.1)
## Hmisc 5.0-1 2023-03-08 [1] CRAN (R 4.2.0)
## homologene 1.4.68.19.3.27 2019-03-28 [1] CRAN (R 4.2.0)
## HPOExplorer 0.99.7 2023-03-22 [1] Bioconductor
## htmlTable 2.4.1 2022-07-07 [1] CRAN (R 4.2.0)
## htmltools 0.5.4 2022-12-07 [1] CRAN (R 4.2.0)
## htmlwidgets 1.6.1 2023-01-07 [1] CRAN (R 4.2.0)
## httpuv 1.6.9 2023-02-14 [1] CRAN (R 4.2.1)
## httr 1.4.5 2023-02-24 [1] CRAN (R 4.2.0)
## impute 1.72.3 2023-01-19 [1] Bioconductor
## interactiveDisplayBase 1.36.0 2022-11-01 [1] Bioconductor
## IRanges 2.32.0 2022-11-01 [1] Bioconductor
## iterators 1.0.14 2022-02-05 [1] CRAN (R 4.2.0)
## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.2.0)
## jsonlite 1.8.4 2022-12-06 [1] CRAN (R 4.2.0)
## KEGGREST 1.38.0 2022-11-01 [1] Bioconductor
## KernSmooth 2.23-20 2021-05-03 [1] CRAN (R 4.2.1)
## knitr 1.42 2023-01-25 [1] CRAN (R 4.2.1)
## later 1.3.0 2021-08-18 [1] CRAN (R 4.2.0)
## lattice 0.20-45 2021-09-22 [1] CRAN (R 4.2.1)
## lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.2.0)
## lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0)
## limma 3.54.2 2023-02-28 [1] Bioconductor
## magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
## Matrix 1.5-3 2022-11-11 [1] CRAN (R 4.2.1)
## MatrixGenerics 1.10.0 2022-11-01 [1] Bioconductor
## matrixStats 0.63.0 2022-11-18 [1] CRAN (R 4.2.0)
## memoise 2.0.1 2021-11-26 [1] CRAN (R 4.2.0)
## mime 0.12 2021-09-28 [1] CRAN (R 4.2.0)
## minidown 0.4.0 2022-02-08 [1] CRAN (R 4.2.0)
## MultiEWCE 0.1.4 2023-03-22 [1] Bioconductor
## munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0)
## network 1.18.1 2023-01-24 [1] CRAN (R 4.2.1)
## nlme 3.1-162 2023-01-31 [1] CRAN (R 4.2.0)
## nnet 7.3-18 2022-09-28 [1] CRAN (R 4.2.0)
## ontologyIndex 2.10 2022-08-24 [1] CRAN (R 4.2.0)
## ontologyPlot 1.6 2021-02-10 [1] CRAN (R 4.2.0)
## orthogene 1.4.1 2022-11-13 [1] Bioconductor
## paintmap 1.0 2016-08-31 [1] CRAN (R 4.2.0)
## patchwork 1.1.2 2022-08-19 [1] CRAN (R 4.2.0)
## piggyback 0.1.4 2022-07-19 [1] CRAN (R 4.2.0)
## pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.0)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0)
## plotly 4.10.1 2022-11-07 [1] CRAN (R 4.2.1)
## plyr 1.8.8 2022-11-11 [1] CRAN (R 4.2.1)
## png 0.1-8 2022-11-29 [1] CRAN (R 4.2.1)
## preprocessCore 1.60.2 2023-01-19 [1] Bioconductor
## promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.2.0)
## purrr 1.0.1 2023-01-10 [1] CRAN (R 4.2.0)
## R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0)
## rappdirs 0.3.3 2021-01-31 [1] CRAN (R 4.2.0)
## RColorBrewer 1.1-3 2022-04-03 [1] CRAN (R 4.2.0)
## Rcpp 1.0.10 2023-01-22 [1] CRAN (R 4.2.1)
## RCurl 1.98-1.10 2023-01-27 [1] CRAN (R 4.2.0)
## reshape2 1.4.4 2020-04-09 [1] CRAN (R 4.2.0)
## Rgraphviz 2.42.0 2022-11-01 [1] Bioconductor
## rlang 1.1.0 2023-03-14 [1] CRAN (R 4.2.0)
## rmarkdown 2.20.1 2023-02-16 [1] Github (rstudio/rmarkdown@a75dc37)
## RNOmni 1.0.1 2022-08-15 [1] CRAN (R 4.2.0)
## rpart 4.1.19 2022-10-21 [1] CRAN (R 4.2.0)
## RSQLite 2.3.0 2023-02-17 [1] CRAN (R 4.2.0)
## rstatix 0.7.2 2023-02-01 [1] CRAN (R 4.2.1)
## rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.0)
## S4Vectors 0.36.2 2023-02-26 [1] Bioconductor
## sass 0.4.5 2023-01-24 [1] CRAN (R 4.2.0)
## scales 1.2.1 2022-08-20 [1] CRAN (R 4.2.0)
## sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0)
## shiny 1.7.4 2022-12-15 [1] CRAN (R 4.2.1)
## SingleCellExperiment 1.20.0 2022-11-01 [1] Bioconductor
## statnet.common 4.8.0 2023-01-24 [1] CRAN (R 4.2.1)
## stringi 1.7.12 2023-01-11 [1] CRAN (R 4.2.0)
## stringr 1.5.0 2022-12-02 [1] CRAN (R 4.2.0)
## SummarizedExperiment 1.28.0 2022-11-01 [1] Bioconductor
## survival 3.5-5 2023-03-12 [1] CRAN (R 4.2.0)
## tibble 3.2.0 2023-03-08 [1] CRAN (R 4.2.0)
## tidyr 1.3.0 2023-01-24 [1] CRAN (R 4.2.1)
## tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.0)
## tidytree 0.4.2 2022-12-18 [1] CRAN (R 4.2.0)
## treeio 1.23.1 2023-03-16 [1] Github (GuangchuangYu/treeio@2f756cb)
## utf8 1.2.3 2023-01-31 [1] CRAN (R 4.2.1)
## vctrs 0.6.0 2023-03-16 [1] CRAN (R 4.2.1)
## viridisLite 0.4.1 2022-08-22 [1] CRAN (R 4.2.0)
## WGCNA 1.72-1 2023-01-18 [1] CRAN (R 4.2.0)
## xfun 0.37 2023-01-31 [1] CRAN (R 4.2.1)
## xtable 1.8-4 2019-04-21 [1] CRAN (R 4.2.0)
## XVector 0.38.0 2022-11-01 [1] Bioconductor
## yaml 2.3.7 2023-01-23 [1] CRAN (R 4.2.1)
## yulab.utils 0.0.6 2022-12-20 [1] CRAN (R 4.2.1)
## zlibbioc 1.44.0 2022-11-01 [1] Bioconductor
##
## [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
##
## ──────────────────────────────────────────────────────────────────────────────