Prioritise target genes based on a procedure:
Disease-level: keep_deaths
:
Keep only diseases with a certain age of death.
Disease-level: severity_threshold_max
:
Keep only diseases annotated as a certain degree of severity or greater
(filters on maximum severity per disease).
Phenotype-level: prune_ancestors
:
Remove redundant ancestral phenotypes when at least one of their
descendants already exist.
Phenotype-level: keep_descendants
:
Remove phenotypes belonging to a certain branch of the HPO,
as defined by an ancestor term.
Phenotype-level: keep_ont_levels
:
Keep only phenotypes at certain absolute ontology levels within the HPO.
Phenotype-level: pheno_ndiseases_threshold
:
The maximum number of diseases each phenotype can be associated with.
Phenotype-level: keep_tiers
:
Keep only phenotypes with high severity Tiers.
Phenotype-level: severity_threshold
:
Keep only phenotypes with mean Severity equal to or below the threshold.
Phenotype-level: gpt_filters
:
Keep only phenotypes with certain GPT annotations in specific
severity metrics.
Phenotype-level: severity_score_gpt_threshold
:
Keep only phenotypes with a minimum GPT severity score.
Phenotype-level: info_content_threshold
:
Keep only phenotypes with a minimum information criterion score
(computed from the HPO).
Symptom-level: pheno_frequency_threshold
:
Keep only phenotypes with mean frequency equal to or above the threshold
(i.e. how frequently a phenotype is associated with any diseases in
which it occurs).
Symptom-level: keep_onsets
:
Keep only symptoms with a certain age of onset.
Symptom-level: symptom_p_threshold
:
Uncorrected p-value threshold to filter cell type-symptom associations by.
Symptom-level: symptom_intersection_threshold
:
Minimum proportion of genes overlapping between a symptom gene list
(phenotype-associated genes in the context of a particular disease)
and the phenotype-cell type association driver genes.
Cell type-level: q_threshold
:
Keep only cell type-phenotype association results at q<=0.05.
Cell type-level: effect_threshold
:
Keep only cell type-phenotype association results at effect size>=1.
Cell type-level: keep_celltypes
:
Keep only terminally differentiated cell types.
Gene-level: keep_chr
:
Remove genes on non-standard chromosomes.
Gene-level: evidence_score_threshold
:
Remove genes that are below an aggregate phenotype-gene
evidence score threshold.
Gene-level: gene_size
:
Keep only genes <4.3kb in length.
Gene-level: add_driver_genes
:
Keep only genes that are driving the association with a given phenotype
(inferred by the intersection of phenotype-associated genes and gene with
high-specificity quantiles in the target cell type).
Gene-level: keep_biotypes
:
Keep only genes belonging to certain biotypes.
Gene-level: gene_frequency_threshold
:
Keep only genes at or above a certain mean frequency threshold
(i.e. how frequently a gene is associated with a given phenotype
when observed within a disease).
Gene-level: keep_specificity_quantiles
:
Keep only genes in top specificity quantiles
from the cell type dataset (CTD).
Gene-level: keep_mean_exp_quantiles
:
Keep only genes in top mean expression quantiles
from the cell type dataset (CTD).
Gene-level: symptom_gene_overlap
:
Ensure that genes nominated at the phenotype-level also
appear in the genes overlapping at the cell type-specific symptom-level.
All levels: sort_cols
:
Sort candidate targets by one or more columns
(e.g. "severity_score_gpt", "q").
All levels: top_n
:
Only return the top N targets per variable group
(specified with the "group_vars" argument).
For example, setting "group_vars" to "hpo_id" and "top_n" to 1 would
only return one target (row) per phenotype ID after sorting.
prioritise_targets(
results = load_example_results(),
ctd_list = load_example_ctd(c("ctd_DescartesHuman.rds", "ctd_HumanCellLandscape.rds"),
multi_dataset = TRUE),
phenotype_to_genes = HPOExplorer::load_phenotype_to_genes(),
hpo = HPOExplorer::get_hpo(),
keep_deaths = HPOExplorer::list_deaths(exclude = c("Miscarriage", "Stillbirth",
"Prenatal death"), include_na = TRUE),
keep_descendants = c("Phenotypic abnormality"),
keep_ont_levels = NULL,
pheno_ndiseases_threshold = NULL,
gpt_filters = NULL,
severity_score_gpt_threshold = 20,
keep_tiers = NULL,
severity_threshold_max = NULL,
info_content_threshold = 8,
run_prune_ancestors = TRUE,
severity_threshold = NULL,
pheno_frequency_threshold = NULL,
keep_onsets = HPOExplorer::list_onsets(include_na = TRUE),
effect_var = "logFC",
q_threshold = 0.05,
effect_threshold = 1,
symptom_intersection_threshold = 0.25,
keep_celltypes = NULL,
evidence_score_threshold = 15,
keep_chr = c(seq(22), "X", "Y"),
gene_size = list(min = 0, max = Inf),
gene_frequency_threshold = NULL,
keep_biotypes = NULL,
keep_specificity_quantiles = seq(30, 40),
keep_mean_exp_quantiles = seq(30, 40),
sort_cols = c(severity_score_gpt = -1, q = 1, logFC = -1, specificity = -1, mean_exp =
-1, pheno_freq_mean = -1, gene_freq_mean = -1, width = 1),
top_n = NULL,
group_vars = c("hpo_id"),
return_report = TRUE,
verbose = TRUE
)
The cell type-phenotype enrichment results generated by gen_results and merged together with merge_results
A named list of CellTypeDataset objects each created with generate_celltype_data.
Output of load_phenotype_to_genes mapping phenotypes to gene annotations.
Human Phenotype Ontology object, loaded from get_ontology.
The age of death associated with each HPO ID to keep. If >1 age of death is associated with the term, only the earliest age is considered. See add_death for details.
Terms whose descendants should be kept
(including themselves).
Set to NULL
(default) to skip this filtering step.
Only keep phenotypes at certain absolute ontology levels to keep. See add_ont_lvl for details.
Filter phenotypes by the maximum number of diseases they are associated with.
A named list of filters to apply to the GPT annotations.
The minimum GPT severity score that a phenotype can have across any disease.
Tiers from hpo_tiers to keep.
Include NA
if you wish to retain phenotypes that
do not have any Tier assignment.
The max severity score that a phenotype can have across any disease.
Minimum phenotype information content threshold.
Prune redundant ancestral terms if any of their descendants are present. Passes to prune_ancestors.
Only keep phenotypes with a mean
severity score (averaged across multiple associated diseases) below the
set threshold. The severity score ranges from 1-4 where 1 is the MOST severe.
Include NA
if you wish to retain phenotypes that
do not have any severity score.
Only keep phenotypes with frequency
above the set threshold. Frequency ranges from 0-100 where 100 is
a phenotype that occurs 100% of the time in all associated diseases.
Include NA
if you wish to retain phenotypes that
do not have any frequency data.
See add_pheno_frequency for details.
The age of onset associated with each HPO ID to keep. If >1 age of onset is associated with the term, only the earliest age is considered. See add_onset for details.
Name of the effect size column in the results
.
The q value threshold to subset the results
by.
The minimum fold change in specific expression
to subset the results
by.
Minimum proportion of genes overlapping between a symptom gene list (phenotype-associated genes in the context of a particular disease) and the phenotype-cell type association driver genes
Cell type to keep.
The minimum threshold of mean evidence scores of each gene-phenotype association to keep.
Chromosomes to keep.
Min/max gene size (important for therapeutics design).
Only keep genes with frequency
above the set threshold. Frequency ranges from 0-100 where 100 is
a gene that occurs 100% of the time in a given phenotype.
Include NA
if you wish to retain genes that
do not have any frequency data.
See add_gene_frequency for details.
Which gene biotypes to keep. (e.g. "protein_coding", "processed_transcript", "snRNA", "lincRNA", "snoRNA", "IG_C_gene")
Which cell type specificity quantiles to keep (max quantile is 40).
Which cell type mean expression quantiles to keep (max quantile is 40).
How to sort the rows using setorderv.
names(sort_cols)
will be supplied to the cols=
argument
and values will be supplied to the order=
argument.
Top N genes to keep when grouping by group_vars
.
Columns to group by when selecting top_n
genes.
If TRUE
, will return a named list containing a
report
that shows the number of
phenotypes/celltypes/genes remaining after each filtering step.
Print messages.
A data.table of the prioritised phenotype- and cell type-specific gene targets.
Term key:
Disease:
A disease defined in the database
OMIM, DECIPHER and/or Orphanet.
Phenotype: A clinical feature associated with one or more diseases.
Symptom:
A phenotype within the context of a particular disease.
Within a given phenotype, there may be multiple symptoms with
partially overlapping genetic mechanisms.
Assocation:
A cell type-specific enrichment test result conducted
at the disease-level, phenotype-level, or symptom-level.
results = load_example_results()[q<0.05]
out <- prioritise_targets(results=results)
#> Loading ctd_DescartesHuman.rds
#> Loading ctd_HumanCellLandscape.rds
#> Reading cached RDS file: phenotype_to_genes.txt
#> + Version: v2024-12-12
#> Prioritising gene targets.
#> Adding logFC column.
#> Adding HPO names.
#> Translating ontology terms to names.
#> Adding term definitions.
#> Adding information_content scores.
#> Reading cached RDS file: phenotype_to_genes.txt
#> + Version: v2024-12-12
#> Adding genes and disease IDs.
#> Prioritised targets: step='start'
#> - Rows: 10,938,355
#> - Phenotypes: 9,575
#> - Diseases: 8,805
#> - Cell types: 201
#> - Genes: 5,092
#> Filtering @ q-value <= 0.05
#> Prioritised targets: step='q_threshold'
#> - Rows: 10,938,355
#> - Phenotypes: 9,575
#> - Diseases: 8,805
#> - Cell types: 201
#> - Genes: 5,092
#> Filtering @ logFC >= 1
#> Prioritised targets: step='effect_threshold'
#> - Rows: 16,235
#> - Phenotypes: 637
#> - Diseases: 2,664
#> - Cell types: 130
#> - Genes: 2,364
#> Annotating phenos with AgeOfDeath.
#> Translating ontology terms to ids.
#> Prioritised targets: step='keep_deaths'
#> - Rows: 16,209
#> - Phenotypes: 633
#> - Diseases: 2,657
#> - Cell types: 130
#> - Genes: 2,362
#> Adding level-2 ancestor to each HPO ID.
#> Adding ancestor metadata.
#> Ancestor metadata already present. Use force_new=TRUE to overwrite.
#> Translating ontology terms to ids.
#> Keeping descendants of 1 term(s).
#> 18,379 terms remain after filtering.
#> 16,198 associations remain after filtering.
#> Prioritised targets: step='keep_descendants'
#> - Rows: 16,198
#> - Phenotypes: 627
#> - Diseases: 2,653
#> - Cell types: 130
#> - Genes: 2,362
#> Getting absolute ontology level for 19,025 IDs.
#> Prioritised targets: step='keep_ont_levels'
#> - Rows: 16,198
#> - Phenotypes: 627
#> - Diseases: 2,653
#> - Cell types: 130
#> - Genes: 2,362
#> Translating ontology terms to ids.
#> Reading cached RDS file: phenotype_to_genes.txt
#> + Version: v2024-12-12
#> 383 phenotypes do not have matching HPO IDs.
#> Reading in GPT annotations for 16,753 phenotypes.
#> Prioritised targets: step='gpt_filters'
#> - Rows: 16,198
#> - Phenotypes: 627
#> - Diseases: 2,653
#> - Cell types: 130
#> - Genes: 2,362
#> Prioritised targets: step='severity_score_gpt_threshold'
#> - Rows: 5,157
#> - Phenotypes: 258
#> - Diseases: 1,459
#> - Cell types: 89
#> - Genes: 1,457
#> Prioritised targets: step='info_content_threshold'
#> - Rows: 3,605
#> - Phenotypes: 219
#> - Diseases: 1,040
#> - Cell types: 88
#> - Genes: 1,172
#> Annotating phenos with onset.
#> Translating ontology terms to ids.
#> Prioritised targets: step='keep_onsets'
#> - Rows: 3,605
#> - Phenotypes: 219
#> - Diseases: 1,040
#> - Cell types: 88
#> - Genes: 1,172
#> Annotating phenos with Tiers.
#> Prioritised targets: step='keep_tiers'
#> - Rows: 3,605
#> - Phenotypes: 219
#> - Diseases: 1,040
#> - Cell types: 88
#> - Genes: 1,172
#> Annotating phenos with modifiers
#> Prioritised targets: step='severity_threshold'
#> - Rows: 3,605
#> - Phenotypes: 219
#> - Diseases: 1,040
#> - Cell types: 88
#> - Genes: 1,172
#> Prioritised targets: step='severity_threshold_max'
#> - Rows: 3,605
#> - Phenotypes: 219
#> - Diseases: 1,040
#> - Cell types: 88
#> - Genes: 1,172
#> Annotating phenos with n_diseases
#> Reading cached RDS file: phenotype_to_genes.txt
#> + Version: v2024-12-12
#> Reading cached RDS file: genes_to_phenotype.txt
#> + Version: v2024-12-12
#> Reading cached RDS file: phenotype.hpoa
#> + Version: v2024-12-12
#> Prioritised targets: step='pheno_ndiseases_threshold'
#> - Rows: 3,605
#> - Phenotypes: 219
#> - Diseases: 1,040
#> - Cell types: 88
#> - Genes: 1,172
#> Annotating phenotype frequencies.
#> Prioritised targets: step='pheno_frequency_threshold'
#> - Rows: 3,614
#> - Phenotypes: 219
#> - Diseases: 1,040
#> - Cell types: 88
#> - Genes: 1,172
#> Prioritised targets: step='keep_celltypes'
#> - Rows: 3,614
#> - Phenotypes: 219
#> - Diseases: 1,040
#> - Cell types: 88
#> - Genes: 1,172
#> Converting phenos to GRanges.
#> Loading required namespace: ensembldb
#> Gathering metadata for 1172 unique genes.
#> Loading required namespace: EnsDb.Hsapiens.v75
#> Prioritised targets: step='symptom_gene_overlap'
#> - Rows: 3,235
#> - Phenotypes: 219
#> - Diseases: 997
#> - Cell types: 88
#> - Genes: 1,097
#> Filtering by keep_chr.
#> Prioritised targets: step='keep_chr'
#> - Rows: 3,235
#> - Phenotypes: 219
#> - Diseases: 997
#> - Cell types: 88
#> - Genes: 1,097
#> Filtering by gene-disease association evidence.
#> Annotating gene-disease associations with Evidence Score
#> Gathering data from GenCC.
#> Importing cached file.
#> Evidence scores for:
#> - 10514 diseases
#> - 5171 genes
#> + Version: 2024-12-19
#> Prioritised targets: step='evidence_score_threshold'
#> - Rows: 288
#> - Phenotypes: 70
#> - Diseases: 138
#> - Cell types: 36
#> - Genes: 139
#> Filtering by gene size.
#> 139 / 139 genes kept.
#> Prioritised targets: step='gene_size'
#> - Rows: 288
#> - Phenotypes: 70
#> - Diseases: 138
#> - Cell types: 36
#> - Genes: 139
#> Prioritised targets: step='keep_biotypes'
#> - Rows: 288
#> - Phenotypes: 70
#> - Diseases: 138
#> - Cell types: 36
#> - Genes: 139
#> Reading cached RDS file: phenotype_to_genes.txt
#> + Version: v2024-12-12
#> Prioritised targets: step='add_driver_genes'
#> - Rows: 202
#> - Phenotypes: 65
#> - Diseases: 99
#> - Cell types: 33
#> - Genes: 97
#> Adding symptom-level results.
#> Subsetting results by q_threshold and effect.
#> 202 associations remain after filtering.
#> Reading cached RDS file: phenotype_to_genes.txt
#> + Version: v2024-12-12
#> Dropping subthreshold symptoms.
#> Prioritised targets: step='symptom_intersection_threshold'
#> - Rows: 202
#> - Phenotypes: 65
#> - Diseases: 99
#> - Cell types: 33
#> - Genes: 97
#> Annotating gene frequencies.
#> Reading cached RDS file: genes_to_phenotype.txt
#> + Version: v2024-12-12
#> Prioritised targets: step='gene_frequency_threshold'
#> - Rows: 246
#> - Phenotypes: 65
#> - Diseases: 99
#> - Cell types: 33
#> - Genes: 97
#> Pruning ancestors.
#> 58 / 65 terms were kept after pruning.
#> Prioritised targets: step='prune_ancestors'
#> - Rows: 215
#> - Phenotypes: 58
#> - Diseases: 99
#> - Cell types: 33
#> - Genes: 97
#> Sorting rows.
#> Prioritised targets: step='end'
#> - Rows: 215
#> - Phenotypes: 58
#> - Diseases: 99
#> - Cell types: 33
#> - Genes: 97
#> Adding disease_name and disease_description.