Prioritise target genes based on a procedure:
Disease-level: keep_deaths
:
Keep only diseases with a certain age of death.
Phenotype-level: remove_descendants
:
Remove phenotypes belonging to a certain branch of the HPO,
as defined by an ancestor term.
Phenotype-level: keep_ont_levels
:
Keep only phenotypes at certain absolute ontology levels within the HPO.
Phenotype-level: pheno_ndiseases_threshold
:
The maximum number of diseases each phenotype can be associated with.
Phenotype-level: keep_tiers
:
Keep only phenotypes with high severity Tiers.
Phenotype-level: severity_threshold
:
Keep only phenotypes with mean Severity equal to or below the threshold.
Symptom-level: pheno_frequency_threshold
:
Keep only phenotypes with mean frequency equal to or above the threshold
(i.e. how frequently a phenotype is associated with any diseases in
which it occurs).
Symptom-level: keep_onsets
:
Keep only symptoms with a certain age of onset.
Symptom-level: symptom_p_threshold
:
Uncorrected p-value threshold to filter cell type-symptom associations by.
Symptom-level: symptom_intersection_size_threshold
:
Minimum number of genes overlapping between a symptom gene list
(phenotype-associated genes in the context of a particular disease)
and the celltype (genes in top 10/40 specificity quantiles).
Cell type-level: q_threshold
:
Keep only cell type-phenotype association results at q<=0.05.
Cell type-level: fold_threshold
:
Keep only cell type-phenotype association results at fold_change>=1.
Cell type-level: keep_celltypes
:
Keep only terminally differentiated cell types.
Gene-level: keep_seqnames
:
Remove genes on non-standard chromosomes.
Gene-level: gene_size
:
Keep only genes <4.3kb in length.
Gene-level: keep_biotypes
:
Keep only genes belonging to certain biotypes.
Gene-level: gene_frequency_threshold
:
Keep only genes at or above a certain mean frequency threshold
(i.e. how frequently a gene is associated with a given phenotype
when observed within a disease).
Gene-level: keep_specificity_quantiles
:
Keep only genes in top specificity quantiles
from the cell type dataset (ctd
).
Gene-level: keep_mean_exp_quantiles
:
Keep only genes in top mean expression quantiles
from the cell type dataset (ctd
).
Gene-level: symptom_gene_overlap
:
Ensure that genes nominated at the phenotype-level also
appear in the genes overlapping at the cell type-specific symptom-level.
All levels: top_n
:
Sort candidate targets by a preferred order of metrics and
only return the top N targets per cell type-phenotype combination.
prioritise_targets(
results = load_example_results(),
ctd = load_example_ctd(),
annotLevel = 1,
phenotype_to_genes = HPOExplorer::load_phenotype_to_genes(),
hpo = HPOExplorer::get_hpo(),
keep_deaths = HPOExplorer::list_deaths(exclude = c("Miscarriage", "Stillbirth",
"Prenatal death"), include_na = FALSE),
remove_descendants = c("Clinical course"),
keep_ont_levels = NULL,
pheno_ndiseases_threshold = NULL,
keep_tiers = c(1, 2, NA),
severity_threshold_max = NULL,
severity_threshold = c(2, NA),
pheno_frequency_threshold = NULL,
keep_onsets = HPOExplorer::list_onsets(exclude = c("Antenatal", "Fetal", "Congenital"),
include_na = TRUE),
q_threshold = 0.05,
fold_threshold = 2,
symptom_p_threshold = NULL,
symptom_intersection_size_threshold = 1,
keep_celltypes = terminal_celltypes()$CellType,
keep_evidence = seq(3, 6),
keep_seqnames = c(seq(22), "X", "Y"),
gene_size = list(min = 0, max = Inf),
gene_frequency_threshold = NULL,
keep_biotypes = NULL,
keep_specificity_quantiles = NULL,
keep_mean_exp_quantiles = seq(1, 40),
symptom_gene_overlap = TRUE,
sort_cols = c(tier_merge = 1, Severity_score_mean = 1, q = 1, fold_change = -1,
specificity = -1, mean_exp = -1, pheno_freq_mean = -1, gene_freq_mean = -1, width =
1),
top_n = NULL,
group_vars = c("disease_id", "hpo_id", "CellType"),
return_report = TRUE,
verbose = TRUE
)
The cell type-phenotype enrichment results generated by gen_results and merged together with merge_results.
Cell Type Data List generated using generate_celltype_data.
An integer indicating which level of sct_data
to
analyse (Default: 1).
Output of load_phenotype_to_genes mapping phenotypes to gene annotations.
Human Phenotype Ontology object, loaded from ontologyIndex.
The age of death associated with each HPO ID to keep. If >1 age of death is associated with the term, only the earliest age is considered. See add_death for details.
Remove HPO terms that are descendants of a given
ancestral HPO term. Ancestral terms be provided as a character vector of
phenotype names (e.g. c("Clinical course")
),
HPO IDs (e.g. "HP:0031797"
) or a mixture of the two.
See add_ancestor for details.
Only keep phenotypes at certain absolute ontology levels to keep. See add_ont_lvl for details.
Filter phenotypes by the maximum number of diseases they are associated with.
Tiers from hpo_tiers to keep.
Include NA
if you wish to retain phenotypes that
do not have any Tier assignment.
The max severity score that a phenotype can have across any disease.
Only keep phenotypes with a mean
severity score (averaged across multiple associated diseases) below the
set threshold. The severity score ranges from 1-4 where 1 is the MOST severe.
Include NA
if you wish to retain phenotypes that
do not have any severity score.
Only keep phenotypes with frequency
above the set threshold. Frequency ranges from 0-100 where 100 is
a phenotype that occurs 100% of the time in all associated diseases.
Include NA
if you wish to retain phenotypes that
do not have any frequency data.
See add_pheno_frequency for details.
The age of onset associated with each HPO ID to keep. If >1 age of onset is associated with the term, only the earliest age is considered. See add_onset for details.
The q value threshold to subset the results
by.
The minimum fold change in specific expression
to subset the results
by.
The p-value threshold of celltype-symptom enrichment results (using gen_overlap). Here, "symptoms" are defined as the presentation of a phenotype in the context of a particular disease. In other words: phenotype (hpo_id) + disease (disease_id) = symptom (hpo_id.disease_id)
The minimum number of intersecting genes between a symptom and a celltype to consider it a significant enrichment. Refers to the result from gen_overlap.
Cell type to keep.
The evidence scores of each gene-disease association to keep.
Chromosomes to keep.
Min/max gene size (important for therapeutics design).
Only keep genes with frequency
above the set threshold. Frequency ranges from 0-100 where 100 is
a gene that occurs 100% of the time in a given phenotype.
Include NA
if you wish to retain genes that
do not have any frequency data.
See add_gene_frequency for details.
Which gene biotypes to keep. (e.g. "protein_coding", "processed_transcript", "snRNA", "lincRNA", "snoRNA", "IG_C_gene")
Which cell type specificity quantiles to keep (max quantile is 40).
Which cell type mean expression quantiles to keep (max quantile is 40).
The gene for a particular symptom (phenotype + disease) must appear in the celltype-symptom enrichment results.
How to sort the rows using setorderv.
names(sort_cols)
will be supplied to the cols=
argument
and values will be supplied to the order=
argument.
Top N genes to keep when grouping by group_vars
.
Columns to group by when selecting top_n
genes.
If TRUE
, will return a named list containing a
report
that shows the number of
phenotypes/celltypes/genes remaining after each filtering step.
Print messages.
A data.table of the prioritised phenotype- and celltype-specific gene targets.
Term key:
Disease: A disease defined in the database OMIM, DECIPHER and/or Orphanet.
Phenotype: A clinical feature associated with one or more diseases.
Symptom: A phenotype within the context of a particular disease. Within a given phenotype, there may be multiple symptoms with partially overlapping genetic mechanisms.
Assocation: A cell type-specific enrichment test result conducted at the disease-level, phenotype-level, or symptom-level.
results = load_example_results()[seq(5000),]
res <- prioritise_targets(results=results)
#> Prioritising gene targets.
#> Adding term definitions.
#> ℹ All local files already up-to-date!
#> Annotating phenos with Disease
#> Reading cached RDS file: phenotype.hpoa
#> + Version: v2023-10-09
#> Adding disease metadata: Definitions, Preferred.Label
#> Importing Orphanet metadata.
#> Importing OMIM metadata.
#> 1178 / 2469 (47.71%) disease_name missing.
#> 1867 / 2335 (79.96%) Definitions missing.
#> Annotating phenos with MONDO metadata.
#> ℹ All local files already up-to-date!
#> 7 / 2335 (0.3%) MONDO_ID missing.
#> 2037 / 2335 (87.24%) MONDO_name missing.
#> 2289 / 2335 (98.03%) MONDO_definition missing.
#> 1828 / 2335 (78.29%) Definitions missing.
#> Prioritised targets: step='start'
#> - Rows: 5,000
#> - Phenotypes: 3
#> - Diseases: 2,335
#> - Cell types: 14
#> Filtering @ q-value <= 0.05
#> Prioritised targets: step='q_threshold'
#> - Rows: 5,000
#> - Phenotypes: 3
#> - Diseases: 2,335
#> - Cell types: 14
#> Filtering @ fold-change >= 2
#> Prioritised targets: step='fold_threshold'
#> - Rows: 20
#> - Phenotypes: 1
#> - Diseases: 20
#> - Cell types: 1
#> Prioritised targets: step='symptom_p_threshold'
#> - Rows: 20
#> - Phenotypes: 1
#> - Diseases: 20
#> - Cell types: 1
#> Prioritised targets: step='symptom_intersection_size_threshold'
#> - Rows: 20
#> - Phenotypes: 1
#> - Diseases: 20
#> - Cell types: 1
#> Annotating phenos with AgeOfDeath.
#> Translating all phenotypes to HPO IDs.
#> ℹ All local files already up-to-date!
#> + Returning a dictionary of phenotypes (different order as input).
#> Prioritised targets: step='keep_deaths'
#> - Rows: 1
#> - Phenotypes: 1
#> - Diseases: 1
#> - Cell types: 1
#> Adding level-3 ancestor to each HPO ID.
#> ℹ All local files already up-to-date!
#> Removing remove descendants of: 'Clinical course'
#> Translating all phenotypes to HPO IDs.
#> + Returning a dictionary of phenotypes (different order as input).
#> Prioritised targets: step='remove_descendants'
#> - Rows: 1
#> - Phenotypes: 1
#> - Diseases: 1
#> - Cell types: 1
#> Getting absolute ontology level for 1 HPO IDs.
#> ℹ All local files already up-to-date!
#> Prioritised targets: step='keep_ont_levels'
#> - Rows: 1
#> - Phenotypes: 1
#> - Diseases: 1
#> - Cell types: 1
#> Annotating phenos with onset.
#> Translating all phenotypes to HPO IDs.
#> ℹ All local files already up-to-date!
#> + Returning a dictionary of phenotypes (different order as input).
#> Prioritised targets: step='keep_onsets'
#> - Rows: 1
#> - Phenotypes: 1
#> - Diseases: 1
#> - Cell types: 1
#> Annotating phenos with Tiers.
#> Prioritised targets: step='keep_tiers'
#> - Rows: 1
#> - Phenotypes: 1
#> - Diseases: 1
#> - Cell types: 1
#> Annotating phenos with modifiers
#> Prioritised targets: step='severity_threshold'
#> - Rows: 1
#> - Phenotypes: 1
#> - Diseases: 1
#> - Cell types: 1
#> Prioritised targets: step='severity_threshold_max'
#> - Rows: 1
#> - Phenotypes: 1
#> - Diseases: 1
#> - Cell types: 1
#> Annotating phenos with n_diseases
#> Reading cached RDS file: phenotype_to_genes.txt
#> + Version: v2023-10-09
#> Reading cached RDS file: genes_to_phenotype.txt
#> + Version: v2023-10-09
#> Reading cached RDS file: phenotype.hpoa
#> + Version: v2023-10-09
#> Prioritised targets: step='pheno_ndiseases_threshold'
#> - Rows: 1
#> - Phenotypes: 1
#> - Diseases: 1
#> - Cell types: 1
#> Annotating phenotype frequencies.
#> Prioritised targets: step='pheno_frequency_threshold'
#> - Rows: 1
#> - Phenotypes: 1
#> - Diseases: 1
#> - Cell types: 1
#> 1 / 1 of cell types kept.
#> Prioritised targets: step='keep_celltypes'
#> - Rows: 1
#> - Phenotypes: 1
#> - Diseases: 1
#> - Cell types: 1
#> Converting phenos to GRanges.
#> Reading cached RDS file: phenotype_to_genes.txt
#> + Version: v2023-10-09
#> Loading required namespace: ensembldb
#> Gathering metadata for 2 unique genes.
#> Loading required namespace: EnsDb.Hsapiens.v75
#> Prioritised targets: step='symptom_gene_overlap'
#> - Rows: 2
#> - Phenotypes: 1
#> - Diseases: 1
#> - Cell types: 1
#> - Genes: 2
#> Filtering by keep_seqnames.
#> Prioritised targets: step='keep_seqnames'
#> - Rows: 2
#> - Phenotypes: 1
#> - Diseases: 1
#> - Cell types: 1
#> - Genes: 2
#> Filtering by gene-disease association evidence.
#> Annotating gene-disease associations with Evidence score
#> Gathering data from GenCC.
#> Importing cached file.
#> + Version: 2023-11-14
#> Prioritised targets: step='keep_evidence'
#> - Rows: 2
#> - Phenotypes: 1
#> - Diseases: 1
#> - Cell types: 1
#> - Genes: 2
#> Filtering by gene size.
#> 2 / 2 genes kept.
#> Prioritised targets: step='gene_size'
#> - Rows: 2
#> - Phenotypes: 1
#> - Diseases: 1
#> - Cell types: 1
#> - Genes: 2
#> Prioritised targets: step='keep_biotypes'
#> - Rows: 2
#> - Phenotypes: 1
#> - Diseases: 1
#> - Cell types: 1
#> - Genes: 2
#> Reading cached RDS file: phenotype_to_genes.txt
#> + Version: v2023-10-09
#> Prioritised targets: step='keep_specificity_quantiles'
#> - Rows: 2
#> - Phenotypes: 1
#> - Diseases: 1
#> - Cell types: 1
#> - Genes: 2
#> Prioritised targets: step='keep_mean_exp_quantiles'
#> - Rows: 2
#> - Phenotypes: 1
#> - Diseases: 1
#> - Cell types: 1
#> - Genes: 2
#> Annotating gene frequencies.
#> Reading cached RDS file: genes_to_phenotype.txt
#> + Version: v2023-10-09
#> Prioritised targets: step='gene_frequency_threshold'
#> - Rows: 2
#> - Phenotypes: 1
#> - Diseases: 1
#> - Cell types: 1
#> - Genes: 2
#> Sorting rows.
#> Prioritised targets: step='end'
#> - Rows: 2
#> - Phenotypes: 1
#> - Diseases: 1
#> - Cell types: 1
#> - Genes: 2