Import annotations

We have stored the GPT-generated annotations on GitHub Releases, and distribute them via the gpt_annot_read function.

annot <- HPOExplorer::gpt_annot_read()
## Reading cached RDS file: phenotype_to_genes.txt
## + Version: v2023-10-09
## Reading in GPT annotations for 10,648 phenotypes.
knitr::kable(head(annot))
hpo_name intellectual_disability intellectual_disability_justification death death_justification impaired_mobility impaired_mobility_justification physical_malformations physical_malformations_justification blindness blindness_justification sensory_impairments sensory_impairments_justification immunodeficiency immunodeficiency_justification cancer cancer_justification reduced_fertility reduced_fertility_justification congenital_onset congenital_onset_justification hpo_id pheno_count
1-2 toe complete cutaneous syndactyly never This condition affects the physical structure of the toes, not cognitive function. rarely Death could occur if complications arise during surgery. often The affected individual may have difficulty walking. always The toes are physically fused together. never This condition does not affect vision. never This condition does not affect sensory function. never This condition does not affect the immune system. never This condition does not increase the risk of cancer. never This condition does not affect fertility. always This condition is present at birth. HP:0005767 1
1-2 toe syndactyly never No direct link between toe syndactyly and intellectual disability never Toe syndactyly does not directly cause death rarely May cause minor mobility issues in some cases always Toe syndactyly is a physical malformation never No direct link between toe syndactyly and blindness never No direct link between toe syndactyly and sensory impairments never No direct link between toe syndactyly and immunodeficiency never No direct link between toe syndactyly and cancer never No direct link between toe syndactyly and reduced fertility always Toe syndactyly is present at birth HP:0010711 1
1-3 toe syndactyly never No relation to cognitive function. never No direct cause of death. often Can hinder normal walking or foot function. always Direct manifestation of the condition. never No relation to vision. never No sensory impairments other than the physical fusion. never Not related. never No relation to cancer. never No direct relation to fertility. always Is congenital in nature. HP:0001459 1
1-4 finger syndactyly never No direct link between finger syndactyly and intellectual disability never Finger syndactyly does not directly cause death often Can cause significant mobility issues depending on severity always Finger syndactyly is a physical malformation never No direct link between finger syndactyly and blindness never No direct link between finger syndactyly and sensory impairments never No direct link between finger syndactyly and immunodeficiency never No direct link between finger syndactyly and cancer never No direct link between finger syndactyly and reduced fertility always Finger syndactyly is present at birth HP:0010707 1
1-4 toe syndactyly never This condition affects the physical structure of the toes, not the brain. never This condition does not directly cause death. often The fusion of toes can lead to balance and walking issues. always Syndactyly is a physical malformation where two or more digits are fused together. never This condition affects the physical structure of the toes, not the eyes. rarely While the condition can potentially affect the nerves in the toes, it is not commonly associated with sensory impairments. never This condition does not affect the immune system. never This condition does not cause cancer. never This condition does not affect fertility. always Syndactyly is a congenital condition, meaning it is present at birth. HP:0010712 1
1-5 finger complete cutaneous syndactyly never No direct correlation with intellectual capacities. never It is not a condition that directly induces mortality. often Syndactyly involves fusion of digits, often leading to mobility issues. always Syndactyly itself is a physical malformation. never There is no impact on the visual system. rarely While general sensory faculties are intact, fine touch discrimination may be rarely affected. never No impact on the immune system’s functionality. never No known association with cancer. never Fertility is not impacted by this physical malformation. always This is a congenital condition and is present from birth. HP:0006088 1

Some phenotype were annotated multiple times, thoughh most were only annotated once:

table(annot$pheno_count)
## 
##     1     2     3 
## 10378   478    93

Validate

We can identify true positives by identifying phenotypes that fall within specific branches HPO that would guarantee them to have at least a “often” as a response. For example, blindness phenotypes should often be associated with blindness.

You can modify the searches with the search_hpo function.

query_hits <- HPOExplorer::search_hpo()
## Loading required namespace: piggyback
## ℹ All local files already up-to-date!
## Number of phenotype gits per query group:
##  - intellectual_disability: 7
##  - impaired_mobility: 294
##  - physical_malformations: 105
##  - blindness: 2
##  - sensory_impairments: 256
##  - immunodeficiency: 10
##  - cancer: 697
##  - reduced_fertility: 6
lapply(query_hits,head)
## $intellectual_disability
## [1] "HP:0001249" "HP:0001256" "HP:0002187" "HP:0002342" "HP:0006887"
## [6] "HP:0006889"
## 
## $impaired_mobility
## [1] "HP:0011442" "HP:0100022" "HP:0000473" "HP:0001336" "HP:0001337"
## [6] "HP:0002071"
## 
## $physical_malformations
## [1] "HP:0001305" "HP:0002308" "HP:0002390" "HP:0002408" "HP:0002438"
## [6] "HP:0002564"
## 
## $blindness
## [1] "HP:0000618" "HP:0007875"
## 
## $sensory_impairments
## [1] "HP:0000223" "HP:0000364" "HP:0000504" "HP:0003474" "HP:0004408"
## [6] "HP:0000224"
## 
## $immunodeficiency
## [1] "HP:0001400" "HP:0002721" "HP:0002755" "HP:0003553" "HP:0004430"
## [6] "HP:0005352"
## 
## $cancer
## [1] "HP:0002664" "HP:0002894" "HP:0002896" "HP:0002898" "HP:0003003"
## [6] "HP:0004375"
## 
## $reduced_fertility
## [1] "HP:0000144" "HP:0000868" "HP:0012041" "HP:0000789" "HP:0008222"
## [6] "HP:0003251"
checks <- HPOExplorer::gpt_annot_check(annot = annot, 
                                       query_hits = query_hits)

Check consistency

When there’s >1 set of annotations for a given phenotype, how consistent are they? (0-1 scale).

sort(unlist(checks$annot_consist))
##     sensory_impairments intellectual_disability                  cancer 
##               0.7703704               0.8259259               0.8370370 
##       reduced_fertility                   death       impaired_mobility 
##               0.8407407               0.8703704               0.8740741 
##  physical_malformations        immunodeficiency        congenital_onset 
##               0.8962963               0.9000000               0.9185185 
##               blindness             pheno_count 
##               0.9296296               1.0000000

Checkable rate

What proportion of annotated phenotypes can be validated (per annotation column)?

sort(checks$checkable_rate)
##               blindness        immunodeficiency       reduced_fertility 
##            0.0001826651            0.0005479953            0.0005479953 
## intellectual_disability  physical_malformations     sensory_impairments 
##            0.0006393278            0.0054799525            0.0138825464 
##       impaired_mobility                  cancer 
##            0.0231071331            0.0416476391

What is the absolute number of phenotypes that can be validated (per annotation column)?

sort(checks$checkable_count)
##               blindness        immunodeficiency       reduced_fertility 
##                       2                       6                       6 
## intellectual_disability  physical_malformations     sensory_impairments 
##                       7                      60                     152 
##       impaired_mobility                  cancer 
##                     253                     456

True positive rate

For the phenotypes that can be validated, how many of them have the expected values (per annotation column).

sort(checks$true_pos_rate)
##       impaired_mobility                  cancer  physical_malformations 
##               0.8853755               0.9385965               0.9666667 
##     sensory_impairments intellectual_disability               blindness 
##               0.9868421               1.0000000               1.0000000 
##        immunodeficiency       reduced_fertility 
##               1.0000000               1.0000000

Summary plot

checks$plot

Codify severity

The gpt_annot_codify function performs a series of steps to clean, filter, and quantify the responses.

coded <- HPOExplorer::gpt_annot_codify(annot = annot)

First, it codifies each response from 0-4:

code_dict = c(
               "never"=0,
               "rarely"=1,
               "varies"=2,
               "often"=3,
               "always"=4
             )
                            

Then it multiplies those response values by the severity of their respective annotation column. This captures the facts that some annotations have more serious consequences than others (e.g death >> reduced_fertility).

 tiers_dict = list(
                   intellectual_disability=1,
                   death=1,
                   impaired_mobility=2,
                   physical_malformations=2,
                   blindness=3,
                   sensory_impairments=3,
                   immunodeficiency=3,
                   cancer=3,
                   reduced_fertility=4
                 )

Next, it takes the multiplied values across all columns and computes an average score per phenotypes. This is then normalised by the theoretical maximum severity score, so that all phenotypes are on a 0-100 severity scale (where 100 is the most severe phenotype possible). This normalised score is added as a new column named “severity_score_gpt”.

Finally, the results are sorted by “severity_score_gpt” so that the most severe phenotypes are at the top of the table.

knitr::kable(head(coded$annot_weighted))
intellectual_disability death impaired_mobility physical_malformations blindness sensory_impairments immunodeficiency cancer reduced_fertility hpo_id severity_score_gpt hpo_name
16 16 12 12 6 8 6 2 3 HP:0007367 51.26582 Atrophy/Degeneration affecting the central nervous system
12 12 9 9 6 6 6 6 3 HP:0000118 43.67089 Phenotypic abnormality
12 12 9 9 2 6 6 6 3 HP:0011463 41.13924 Childhood onset
12 12 9 9 2 6 6 6 3 HP:0011462 41.13924 Young adult onset
12 12 9 12 2 6 2 8 0 HP:0009592 39.87342 Astrocytoma
16 12 9 9 6 6 2 2 1 HP:0007369 39.87342 Atrophy/Degeneration affecting the cerebrum

Download coded data

downloadthis::download_this(.data = coded$annot_weighted, 
                            output_name = "gpt_annot_plot_data",
                            csv2 = FALSE)

Plot

Now let’s summarise the annotation results with plots. The

plts <- HPOExplorer::gpt_annot_plot(annot = annot)
## Loading required namespace: patchwork
## Getting absolute ontology level for 50 HPO IDs.
## ℹ All local files already up-to-date!
## Adding level-3 ancestor to each HPO ID.
## ℹ All local files already up-to-date!
## Removing remove descendants of: 'Clinical course'
##  -'Sporadic'
##  -'Multifactorial inheritance'
##  -'Inheritance modifier'
##  -'Phenotypic variability'
## Translating all phenotypes to HPO IDs.
## + Returning a dictionary of phenotypes (different order as input).
## Adding level-3 ancestor to each HPO ID.
## ℹ All local files already up-to-date!

Raw data

knitr::kable(head(plts$data$dat1))
hpo_id hpo_name severity_score_gpt variable value
HP:0007367 Atrophy/Degeneration affecting the central nervous system 51.26582 intellectual_disability always
HP:0000118 Phenotypic abnormality 43.67089 intellectual_disability often
HP:0011463 Childhood onset 41.13924 intellectual_disability often
HP:0011462 Young adult onset 41.13924 intellectual_disability often
HP:0009592 Astrocytoma 39.87342 intellectual_disability often
HP:0007369 Atrophy/Degeneration affecting the cerebrum 39.87342 intellectual_disability always

Heatmap

Top 50 most severe phenotypes.

plts$gp0

Barplot

Proportion reponses per annotation column.

plts$gp1

Boxplots

Responses vs. severity scores.

plts$gp2

Histograms

Severity score distributions by HPO branch.

Let’s look at the distribution of GPT severity scores across all phenotypes, grouped by which branch of the HPO those phenotypes belong to.

In red, we show the mean severity score per HPO branch.

plts$gp3

Session info

utils::sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.4         xfun_0.41            bslib_0.5.1         
##  [4] ggplot2_3.4.4        httr2_1.0.0          htmlwidgets_1.6.2   
##  [7] gh_1.4.0             lattice_0.22-5       tzdb_0.4.0          
## [10] vctrs_0.6.4          tools_4.3.1          generics_0.1.3      
## [13] parallel_4.3.1       curl_5.1.0           tibble_3.2.1        
## [16] fansi_1.0.5          highr_0.10           pkgconfig_2.0.3     
## [19] data.table_1.14.8    lifecycle_1.0.4      compiler_4.3.1      
## [22] farver_2.1.1         stringr_1.5.1        munsell_0.5.0       
## [25] htmltools_0.5.7      sass_0.4.7           yaml_2.3.7          
## [28] lazyeval_0.2.2       plotly_4.10.3        crayon_1.5.2        
## [31] pillar_1.9.0         jquerylib_0.1.4      tidyr_1.3.0         
## [34] cachem_1.0.8         mime_0.12            network_1.18.1      
## [37] tidyselect_1.2.0     digest_0.6.33        stringi_1.8.1       
## [40] dplyr_1.1.4          purrr_1.0.2          labeling_0.4.3      
## [43] fastmap_1.1.1        grid_4.3.1           colorspace_2.1-0    
## [46] cli_3.6.1            HPOExplorer_0.99.12  magrittr_2.0.3      
## [49] patchwork_1.1.3      base64enc_0.1-3      bsplus_0.1.4        
## [52] piggyback_0.1.5      utf8_1.2.4           readr_2.1.4         
## [55] withr_2.5.2          scales_1.2.1         ggnetwork_0.5.12    
## [58] rappdirs_0.3.3       bit64_4.0.5          lubridate_1.9.3     
## [61] timechange_0.2.0     rmarkdown_2.25       httr_1.4.7          
## [64] gitcreds_0.1.2       bit_4.0.5            hms_1.1.3           
## [67] coda_0.19-4          memoise_2.0.1.9000   evaluate_0.23       
## [70] knitr_1.45           viridisLite_0.4.2    rlang_1.1.2         
## [73] ontologyIndex_2.11   glue_1.6.2           downloadthis_0.3.3  
## [76] rstudioapi_0.15.0    vroom_1.6.4          jsonlite_1.8.7      
## [79] R6_2.5.1             statnet.common_4.9.0 fs_1.6.3