Orphanet provides epidemiological prevalence data on ORPHA concepts (which includes diseases and phenotypes). A fter delving into this data, there’s some key pieces to note.
Detailed descriptions of each column in the data can be found here.
Orphanet only provides their data in XML format, which is useful for making targeting queries but obnoxiously cumbersome for analyzing the data all at once. Stubbornly, they don’t provide any alternative formats and tell user to “convert it yourself”. Really not a fan of their approach to this as it limits the usefulness of the data for many people. After quite a bit of troubleshooting, I did this using excel + postprocessing in R.
# New function that imports Excel-converted dataset with some additional postprocessing in R
d <- HPOExplorer:::get_orphanet_epidemiology(agg_by=NULL)
## Loading required namespace: piggyback
## ℹ All local files already up-to-date!
Prevalences are provided with different denominators (“1 / 10,000”, “1 / 1,000”, etc.). To permit comparison, I’ve parsed the numerators/denominators and then rescaled all prevalences to x cases in 100 people.
data.table::as.data.table(
sort(table(d$Prevalence.PrevalenceClass.Name),decreasing = TRUE)/nrow(d)*100
) |> `names<-`(c("PrevalenceClass","Percent"))
## PrevalenceClass Percent
## 1: <1 / 1 000 000 28.03229440
## 2: 20.02753786
## 3: 1-9 / 100 000 17.37388910
## 4: 1-9 / 1 000 000 13.73138065
## 5: Unknown 12.16047065
## 6: 1-5 / 10 000 7.36637877
## 7: 6-9 / 10 000 0.69470522
## 8: >1 / 1000 0.59456753
## 9: Not yet documented 0.01877582
p <- sum(is.na(d$prevalence))/nrow(d)*100
Within this dataset, prevalence is “NA” or “unknown” for a substantial portion of the dataset (32.2`%).
p <- length(unique(d[!is.na(prevalence)]$id))
In the end, we have non-missing prevalence data for 4716 unique Orphanet IDs.
Usefully, they also provide metadata on the population in which the data was measured (e.g. China, US, worldwide).
MultiEWCE::create_dt(
d[,list(n=.N),by=`Prevalence.PrevalenceGeographic.Name`][order(-n)]
)
## Registered S3 method overwritten by 'ggtree':
## method from
## fortify.igraph ggnetwork
## Loading required namespace: DT
ww <- d[`Prevalence.PrevalenceGeographic.Name`=="Worldwide",]
p <- length(unique(ww$id)) / length(unique(d$id))*100
This selection of population affects the interpretation, and 82.9% of all concepts have a “worldwide” prevalence count.
For now, I’m just computing the mean prevalence per concept across all populations in which that concept’s prevalence was measured.
Prevalence is a bit more of a complicated idea than you might initially think, and can fall into different classes depending on how you study it. See this publication for a more in depth discussion.
See counts of each type below.
data.table::data.table(rev(sort(table(d$Prevalence.PrevalenceType.Name))))
## V1 N
## 1: Point prevalence 7622
## 2: Cases/families 3197
## 3: Annual incidence 3046
## 4: Prevalence at birth 2066
## 5: Lifetime Prevalence 47
A non-trivial aspect of using any of these non-HPO-based datasets is mapping IDs from one ontology to another. Currently, there is no easy way to do this (while avoiding missing data).
The best strategy I’ve come up with for Orphanet is to first map all
Orphanet IDs to MONDO IDs with HPOExplorer:::mondo_dict
.
This function uses cross-ontology referencing from the MONDO ontology
object. Then I do the same for the HPO disease IDs (which are a mix of
Orphanet, OMIM, and DECIPHER IDs). This allows me to have a common ID
framework (MONDO) to match up the two different datasets.
### Get all disease and phenotype data from the HPO
phenos <- HPOExplorer::make_phenos_dataframe(add_disease_data = TRUE)
## Reading cached RDS file: phenotype_to_genes.txt
## + Version: v2023-10-09
## Extracting data for 10,969 descendents.
## Computing gene counts.
## Getting absolute ontology level for 10,969 HPO IDs.
## ℹ All local files already up-to-date!
## Computing ontology level / gene count ratio.
## Adding term definitions.
## ℹ All local files already up-to-date!
## Adding level-3 ancestor to each HPO ID.
## ℹ All local files already up-to-date!
## Annotating phenos with n_diseases
## Reading cached RDS file: phenotype_to_genes.txt
## + Version: v2023-10-09
## Reading cached RDS file: genes_to_phenotype.txt
## + Version: v2023-10-09
## Reading cached RDS file: phenotype.hpoa
## + Version: v2023-10-09
## Annotating phenos with onset.
## Annotating phenos with Disease
## Reading cached RDS file: phenotype.hpoa
## + Version: v2023-10-09
## Annotating phenos with AgeOfDeath.
## Annotating phenos with Tiers.
## Annotating phenos with modifiers
## Annotating phenotype frequencies.
## Adding disease metadata: Definitions, Preferred.Label
## Importing Orphanet metadata.
## Importing OMIM metadata.
## 1 / 12,527 (0.01%) disease_name missing.
## 8,228 / 12,468 (65.99%) Definitions missing.
Using this strategy, I’m able to map: - 100% of Orphanet IDs - 99.7% of disease IDs in the HPO data - 11.7% of phenotype IDs in the HPO data
A decision point we’ll need to consider is whether we’d ultimately like to filter prioritised targets based on: 1. disease (disease_id) frequency 2. phenotype (hpo_id) frequency
Both “disease_id” and “hpo_id” can be mapped onto MONDO IDs, and thus merged with the Orphanet frequency data. Depending on how we do this merge, we get quite different results.
phenos2 <- HPOExplorer::add_prevalence(phenos = phenos,
id_col = "hpo_id")
## Annotating phenos with MONDO metadata.
## ℹ All local files already up-to-date!
## 10,469 / 10,969 (95.44%) MONDO_ID missing.
## 10,966 / 10,969 (99.97%) MONDO_name missing.
## 10,968 / 10,969 (99.99%) MONDO_definition missing.
## ℹ All local files already up-to-date!
## Prevalence added for 3,051 / 12,468 disease_id IDs (24.5%)
## Prevalence added for 63 / 10,969 hpo_id IDs (0.6%)
## Prevalence added for 63 / 501 MONDO_ID IDs (12.6%)
phenos3 <- HPOExplorer::add_prevalence(phenos = phenos,
id_col = "disease_id")
## Annotating phenos with MONDO metadata.
## ℹ All local files already up-to-date!
## 34 / 12,468 (0.27%) MONDO_ID missing.
## 11,573 / 12,468 (92.82%) MONDO_name missing.
## 12,396 / 12,468 (99.42%) MONDO_definition missing.
## ℹ All local files already up-to-date!
## Prevalence added for 33 / 12,468 disease_id IDs (0.3%)
## Prevalence added for 540 / 10,969 hpo_id IDs (4.9%)
## Prevalence added for 33 / 12,435 MONDO_ID IDs (0.3%)
sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Europe/London
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] later_1.3.1 bitops_1.0-7
## [3] ggplotify_0.1.2 GeneOverlap_1.38.0
## [5] filelock_1.0.2 tibble_3.2.1
## [7] R.oo_1.25.0 ontologyPlot_1.6
## [9] ggnetwork_0.5.12 graph_1.80.0
## [11] lifecycle_1.0.4 httr2_0.2.3
## [13] rstatix_0.7.2 HPOExplorer_0.99.12
## [15] lattice_0.22-5 crosstalk_1.2.0
## [17] backports_1.4.1 magrittr_2.0.3
## [19] limma_3.58.1 plotly_4.10.3
## [21] sass_0.4.7 rmarkdown_2.25
## [23] jquerylib_0.1.4 yaml_2.3.7
## [25] httpuv_1.6.12 HGNChelper_0.8.1
## [27] DBI_1.1.3 RColorBrewer_1.1-3
## [29] lubridate_1.9.3 abind_1.4-5
## [31] zlibbioc_1.48.0 GenomicRanges_1.54.1
## [33] purrr_1.0.2 R.utils_2.12.2
## [35] BiocGenerics_0.48.1 RCurl_1.98-1.13
## [37] yulab.utils_0.1.0 rappdirs_0.3.3
## [39] MultiEWCE_0.1.9 GenomeInfoDbData_1.2.11
## [41] IRanges_2.36.0 S4Vectors_0.40.1
## [43] gitcreds_0.1.2 tidytree_0.4.5
## [45] piggyback_0.1.5 DelayedArray_0.28.0
## [47] DT_0.30 tidyselect_1.2.0
## [49] aplot_0.2.2 matrixStats_1.1.0
## [51] stats4_4.3.1 BiocFileCache_2.10.1
## [53] jsonlite_1.8.7 ellipsis_0.3.2
## [55] paintmap_1.0 tools_4.3.1
## [57] treeio_1.26.0 Rcpp_1.0.11
## [59] glue_1.6.2 SparseArray_1.2.2
## [61] xfun_0.41 MatrixGenerics_1.14.0
## [63] GenomeInfoDb_1.38.0 RNOmni_1.0.1.2
## [65] dplyr_1.1.3 BiocManager_1.30.22
## [67] fastmap_1.1.1 fansi_1.0.5
## [69] caTools_1.18.2 digest_0.6.33
## [71] timechange_0.2.0 R6_2.5.1
## [73] mime_0.12 gridGraphics_0.5-1
## [75] colorspace_2.1-0 gtools_3.9.4
## [77] RSQLite_2.3.3 R.methodsS3_1.8.2
## [79] utf8_1.2.4 tidyr_1.3.0
## [81] generics_0.1.3 data.table_1.14.8
## [83] httr_1.4.7 htmlwidgets_1.6.2
## [85] S4Arrays_1.2.0 ontologyIndex_2.11
## [87] pkgconfig_2.0.3 gtable_0.3.4
## [89] blob_1.2.4 SingleCellExperiment_1.24.0
## [91] XVector_0.42.0 htmltools_0.5.7
## [93] carData_3.0-5 scales_1.2.1
## [95] Biobase_2.62.0 png_0.1-8
## [97] ggfun_0.1.3 knitr_1.45
## [99] rstudioapi_0.15.0 reshape2_1.4.4
## [101] coda_0.19-4 statnet.common_4.9.0
## [103] nlme_3.1-163 curl_5.1.0
## [105] cachem_1.0.8 stringr_1.5.0
## [107] BiocVersion_3.18.0 KernSmooth_2.23-22
## [109] parallel_4.3.1 AnnotationDbi_1.64.1
## [111] pillar_1.9.0 grid_4.3.1
## [113] vctrs_0.6.4 gplots_3.1.3
## [115] promises_1.2.1 ggpubr_0.6.0
## [117] car_3.1-2 dbplyr_2.4.0
## [119] xtable_1.8-4 Rgraphviz_2.46.0
## [121] evaluate_0.23 orthogene_1.8.0
## [123] cli_3.6.1 compiler_4.3.1
## [125] rlang_1.1.2 crayon_1.5.2
## [127] grr_0.9.5 ggsignif_0.6.4
## [129] gprofiler2_0.2.2 EWCE_1.11.2
## [131] plyr_1.8.9 fs_1.6.3
## [133] stringi_1.7.12 viridisLite_0.4.2
## [135] ewceData_1.10.0 network_1.18.1
## [137] babelgene_22.9 munsell_0.5.0
## [139] Biostrings_2.70.1 lazyeval_0.2.2
## [141] gh_1.4.0 homologene_1.4.68.19.3.27
## [143] Matrix_1.6-1.1 ExperimentHub_2.10.0
## [145] patchwork_1.1.3 bit64_4.0.5
## [147] ggplot2_3.4.4 KEGGREST_1.42.0
## [149] statmod_1.5.0 shiny_1.7.5.1
## [151] SummarizedExperiment_1.32.0 interactiveDisplayBase_1.40.0
## [153] AnnotationHub_3.10.0 broom_1.0.5
## [155] memoise_2.0.1.9000 bslib_0.5.1
## [157] ggtree_3.10.0 bit_4.0.5
## [159] ape_5.7-1