We have stored the GPT-generated annotations on GitHub Releases, and
distribute them via the gpt_annot_read
function.
annot <- HPOExplorer::gpt_annot_read()
## Reading cached RDS file: phenotype_to_genes.txt
## + Version: v2023-10-09
## Reading in GPT annotations for 10,648 phenotypes.
knitr::kable(head(annot))
hpo_name | intellectual_disability | intellectual_disability_justification | death | death_justification | impaired_mobility | impaired_mobility_justification | physical_malformations | physical_malformations_justification | blindness | blindness_justification | sensory_impairments | sensory_impairments_justification | immunodeficiency | immunodeficiency_justification | cancer | cancer_justification | reduced_fertility | reduced_fertility_justification | congenital_onset | congenital_onset_justification | hpo_id | pheno_count |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1-2 toe complete cutaneous syndactyly | never | This condition affects the physical structure of the toes, not cognitive function. | rarely | Death could occur if complications arise during surgery. | often | The affected individual may have difficulty walking. | always | The toes are physically fused together. | never | This condition does not affect vision. | never | This condition does not affect sensory function. | never | This condition does not affect the immune system. | never | This condition does not increase the risk of cancer. | never | This condition does not affect fertility. | always | This condition is present at birth. | HP:0005767 | 1 |
1-2 toe syndactyly | never | No direct link between toe syndactyly and intellectual disability | never | Toe syndactyly does not directly cause death | rarely | May cause minor mobility issues in some cases | always | Toe syndactyly is a physical malformation | never | No direct link between toe syndactyly and blindness | never | No direct link between toe syndactyly and sensory impairments | never | No direct link between toe syndactyly and immunodeficiency | never | No direct link between toe syndactyly and cancer | never | No direct link between toe syndactyly and reduced fertility | always | Toe syndactyly is present at birth | HP:0010711 | 1 |
1-3 toe syndactyly | never | No relation to cognitive function. | never | No direct cause of death. | often | Can hinder normal walking or foot function. | always | Direct manifestation of the condition. | never | No relation to vision. | never | No sensory impairments other than the physical fusion. | never | Not related. | never | No relation to cancer. | never | No direct relation to fertility. | always | Is congenital in nature. | HP:0001459 | 1 |
1-4 finger syndactyly | never | No direct link between finger syndactyly and intellectual disability | never | Finger syndactyly does not directly cause death | often | Can cause significant mobility issues depending on severity | always | Finger syndactyly is a physical malformation | never | No direct link between finger syndactyly and blindness | never | No direct link between finger syndactyly and sensory impairments | never | No direct link between finger syndactyly and immunodeficiency | never | No direct link between finger syndactyly and cancer | never | No direct link between finger syndactyly and reduced fertility | always | Finger syndactyly is present at birth | HP:0010707 | 1 |
1-4 toe syndactyly | never | This condition affects the physical structure of the toes, not the brain. | never | This condition does not directly cause death. | often | The fusion of toes can lead to balance and walking issues. | always | Syndactyly is a physical malformation where two or more digits are fused together. | never | This condition affects the physical structure of the toes, not the eyes. | rarely | While the condition can potentially affect the nerves in the toes, it is not commonly associated with sensory impairments. | never | This condition does not affect the immune system. | never | This condition does not cause cancer. | never | This condition does not affect fertility. | always | Syndactyly is a congenital condition, meaning it is present at birth. | HP:0010712 | 1 |
1-5 finger complete cutaneous syndactyly | never | No direct correlation with intellectual capacities. | never | It is not a condition that directly induces mortality. | often | Syndactyly involves fusion of digits, often leading to mobility issues. | always | Syndactyly itself is a physical malformation. | never | There is no impact on the visual system. | rarely | While general sensory faculties are intact, fine touch discrimination may be rarely affected. | never | No impact on the immune system’s functionality. | never | No known association with cancer. | never | Fertility is not impacted by this physical malformation. | always | This is a congenital condition and is present from birth. | HP:0006088 | 1 |
Some phenotype were annotated multiple times, thoughh most were only annotated once:
table(annot$pheno_count)
##
## 1 2 3
## 10378 478 93
We can identify true positives by identifying phenotypes that fall within specific branches HPO that would guarantee them to have at least a “often” as a response. For example, blindness phenotypes should often be associated with blindness.
You can modify the searches with the search_hpo
function.
query_hits <- HPOExplorer::search_hpo()
## Loading required namespace: piggyback
## ℹ All local files already up-to-date!
## Number of phenotype gits per query group:
## - intellectual_disability: 7
## - impaired_mobility: 294
## - physical_malformations: 105
## - blindness: 2
## - sensory_impairments: 256
## - immunodeficiency: 10
## - cancer: 697
## - reduced_fertility: 6
lapply(query_hits,head)
## $intellectual_disability
## [1] "HP:0001249" "HP:0001256" "HP:0002187" "HP:0002342" "HP:0006887"
## [6] "HP:0006889"
##
## $impaired_mobility
## [1] "HP:0011442" "HP:0100022" "HP:0000473" "HP:0001336" "HP:0001337"
## [6] "HP:0002071"
##
## $physical_malformations
## [1] "HP:0001305" "HP:0002308" "HP:0002390" "HP:0002408" "HP:0002438"
## [6] "HP:0002564"
##
## $blindness
## [1] "HP:0000618" "HP:0007875"
##
## $sensory_impairments
## [1] "HP:0000223" "HP:0000364" "HP:0000504" "HP:0003474" "HP:0004408"
## [6] "HP:0000224"
##
## $immunodeficiency
## [1] "HP:0001400" "HP:0002721" "HP:0002755" "HP:0003553" "HP:0004430"
## [6] "HP:0005352"
##
## $cancer
## [1] "HP:0002664" "HP:0002894" "HP:0002896" "HP:0002898" "HP:0003003"
## [6] "HP:0004375"
##
## $reduced_fertility
## [1] "HP:0000144" "HP:0000868" "HP:0012041" "HP:0000789" "HP:0008222"
## [6] "HP:0003251"
checks <- HPOExplorer::gpt_annot_check(annot = annot,
query_hits = query_hits)
When there’s >1 set of annotations for a given phenotype, how consistent are they? (0-1 scale).
sort(unlist(checks$annot_consist))
## sensory_impairments intellectual_disability cancer
## 0.7703704 0.8259259 0.8370370
## reduced_fertility death impaired_mobility
## 0.8407407 0.8703704 0.8740741
## physical_malformations immunodeficiency congenital_onset
## 0.8962963 0.9000000 0.9185185
## blindness pheno_count
## 0.9296296 1.0000000
What proportion of annotated phenotypes can be validated (per annotation column)?
sort(checks$checkable_rate)
## blindness immunodeficiency reduced_fertility
## 0.0001826651 0.0005479953 0.0005479953
## intellectual_disability physical_malformations sensory_impairments
## 0.0006393278 0.0054799525 0.0138825464
## impaired_mobility cancer
## 0.0231071331 0.0416476391
What is the absolute number of phenotypes that can be validated (per annotation column)?
sort(checks$checkable_count)
## blindness immunodeficiency reduced_fertility
## 2 6 6
## intellectual_disability physical_malformations sensory_impairments
## 7 60 152
## impaired_mobility cancer
## 253 456
For the phenotypes that can be validated, how many of them have the expected values (per annotation column).
sort(checks$true_pos_rate)
## impaired_mobility cancer physical_malformations
## 0.8853755 0.9385965 0.9666667
## sensory_impairments intellectual_disability blindness
## 0.9868421 1.0000000 1.0000000
## immunodeficiency reduced_fertility
## 1.0000000 1.0000000
checks$plot
The gpt_annot_codify
function performs a series of steps
to clean, filter, and quantify the responses.
coded <- HPOExplorer::gpt_annot_codify(annot = annot)
First, it codifies each response from 0-4:
code_dict = c(
"never"=0,
"rarely"=1,
"varies"=2,
"often"=3,
"always"=4
)
Then it multiplies those response values by the severity of their respective annotation column. This captures the facts that some annotations have more serious consequences than others (e.g death >> reduced_fertility).
tiers_dict = list(
intellectual_disability=1,
death=1,
impaired_mobility=2,
physical_malformations=2,
blindness=3,
sensory_impairments=3,
immunodeficiency=3,
cancer=3,
reduced_fertility=4
)
Next, it takes the multiplied values across all columns and computes an average score per phenotypes. This is then normalised by the theoretical maximum severity score, so that all phenotypes are on a 0-100 severity scale (where 100 is the most severe phenotype possible). This normalised score is added as a new column named “severity_score_gpt”.
Finally, the results are sorted by “severity_score_gpt” so that the most severe phenotypes are at the top of the table.
knitr::kable(head(coded$annot_weighted))
intellectual_disability | death | impaired_mobility | physical_malformations | blindness | sensory_impairments | immunodeficiency | cancer | reduced_fertility | hpo_id | severity_score_gpt | hpo_name |
---|---|---|---|---|---|---|---|---|---|---|---|
16 | 16 | 12 | 12 | 6 | 8 | 6 | 2 | 3 | HP:0007367 | 51.26582 | Atrophy/Degeneration affecting the central nervous system |
12 | 12 | 9 | 9 | 6 | 6 | 6 | 6 | 3 | HP:0000118 | 43.67089 | Phenotypic abnormality |
12 | 12 | 9 | 9 | 2 | 6 | 6 | 6 | 3 | HP:0011463 | 41.13924 | Childhood onset |
12 | 12 | 9 | 9 | 2 | 6 | 6 | 6 | 3 | HP:0011462 | 41.13924 | Young adult onset |
12 | 12 | 9 | 12 | 2 | 6 | 2 | 8 | 0 | HP:0009592 | 39.87342 | Astrocytoma |
16 | 12 | 9 | 9 | 6 | 6 | 2 | 2 | 1 | HP:0007369 | 39.87342 | Atrophy/Degeneration affecting the cerebrum |
downloadthis::download_this(.data = coded$annot_weighted,
output_name = "gpt_annot_plot_data",
csv2 = FALSE)
Now let’s summarise the annotation results with plots. The
plts <- HPOExplorer::gpt_annot_plot(annot = annot)
## Loading required namespace: patchwork
## Getting absolute ontology level for 50 HPO IDs.
## ℹ All local files already up-to-date!
## Adding level-3 ancestor to each HPO ID.
## ℹ All local files already up-to-date!
## Removing remove descendants of: 'Clinical course'
## -'Sporadic'
## -'Multifactorial inheritance'
## -'Inheritance modifier'
## -'Phenotypic variability'
## Translating all phenotypes to HPO IDs.
## + Returning a dictionary of phenotypes (different order as input).
## Adding level-3 ancestor to each HPO ID.
## ℹ All local files already up-to-date!
knitr::kable(head(plts$data$dat1))
hpo_id | hpo_name | severity_score_gpt | variable | value |
---|---|---|---|---|
HP:0007367 | Atrophy/Degeneration affecting the central nervous system | 51.26582 | intellectual_disability | always |
HP:0000118 | Phenotypic abnormality | 43.67089 | intellectual_disability | often |
HP:0011463 | Childhood onset | 41.13924 | intellectual_disability | often |
HP:0011462 | Young adult onset | 41.13924 | intellectual_disability | often |
HP:0009592 | Astrocytoma | 39.87342 | intellectual_disability | often |
HP:0007369 | Atrophy/Degeneration affecting the cerebrum | 39.87342 | intellectual_disability | always |
Top 50 most severe phenotypes.
plts$gp0
Proportion reponses per annotation column.
plts$gp1
Responses vs. severity scores.
plts$gp2
Severity score distributions by HPO branch.
Let’s look at the distribution of GPT severity scores across all phenotypes, grouped by which branch of the HPO those phenotypes belong to.
In red, we show the mean severity score per HPO branch.
plts$gp3
utils::sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.4 xfun_0.41 bslib_0.5.1
## [4] ggplot2_3.4.4 httr2_1.0.0 htmlwidgets_1.6.2
## [7] gh_1.4.0 lattice_0.22-5 tzdb_0.4.0
## [10] vctrs_0.6.4 tools_4.3.1 generics_0.1.3
## [13] parallel_4.3.1 curl_5.1.0 tibble_3.2.1
## [16] fansi_1.0.5 highr_0.10 pkgconfig_2.0.3
## [19] data.table_1.14.8 lifecycle_1.0.4 compiler_4.3.1
## [22] farver_2.1.1 stringr_1.5.1 munsell_0.5.0
## [25] htmltools_0.5.7 sass_0.4.7 yaml_2.3.7
## [28] lazyeval_0.2.2 plotly_4.10.3 crayon_1.5.2
## [31] pillar_1.9.0 jquerylib_0.1.4 tidyr_1.3.0
## [34] cachem_1.0.8 mime_0.12 network_1.18.1
## [37] tidyselect_1.2.0 digest_0.6.33 stringi_1.8.1
## [40] dplyr_1.1.4 purrr_1.0.2 labeling_0.4.3
## [43] fastmap_1.1.1 grid_4.3.1 colorspace_2.1-0
## [46] cli_3.6.1 HPOExplorer_0.99.12 magrittr_2.0.3
## [49] patchwork_1.1.3 base64enc_0.1-3 bsplus_0.1.4
## [52] piggyback_0.1.5 utf8_1.2.4 readr_2.1.4
## [55] withr_2.5.2 scales_1.2.1 ggnetwork_0.5.12
## [58] rappdirs_0.3.3 bit64_4.0.5 lubridate_1.9.3
## [61] timechange_0.2.0 rmarkdown_2.25 httr_1.4.7
## [64] gitcreds_0.1.2 bit_4.0.5 hms_1.1.3
## [67] coda_0.19-4 memoise_2.0.1.9000 evaluate_0.23
## [70] knitr_1.45 viridisLite_0.4.2 rlang_1.1.2
## [73] ontologyIndex_2.11 glue_1.6.2 downloadthis_0.3.3
## [76] rstudioapi_0.15.0 vroom_1.6.4 jsonlite_1.8.7
## [79] R6_2.5.1 statnet.common_4.9.0 fs_1.6.3