Prioritise target genes — prioritise

Prioritise target genes based on a procedure:

Disease-level: keep_deaths: Keep only diseases with a certain age of death.

Disease-level: severity_threshold_max:


Keep only diseases annotated as a certain degree of severity or greater
 (filters on maximum severity per disease).

Phenotype-level: prune_ancestors:


Remove redundant ancestral phenotypes when at least one of their
 descendants already exist.

Phenotype-level: keep_descendants:


Remove phenotypes belonging to a certain branch of the HPO,
 as defined by an ancestor term.

Phenotype-level: keep_ont_levels: Keep only phenotypes at certain absolute ontology levels within the HPO.
Phenotype-level: pheno_ndiseases_threshold: The maximum number of diseases each phenotype can be associated with.
Phenotype-level: keep_tiers: Keep only phenotypes with high severity Tiers.
Phenotype-level: severity_threshold: Keep only phenotypes with mean Severity equal to or below the threshold.

Phenotype-level: gpt_filters:


Keep only phenotypes with certain GPT annotations in specific
 severity metrics.

Phenotype-level: severity_score_gpt_threshold: Keep only phenotypes with a minimum GPT severity score.

Phenotype-level: info_content_threshold:


Keep only phenotypes with a minimum information criterion score
 (computed from the HPO).

Symptom-level: pheno_frequency_threshold:


Keep only phenotypes with mean frequency equal to or above the threshold
 (i.e. how frequently a phenotype is associated with any diseases in
 which it occurs).

Symptom-level: keep_onsets: Keep only symptoms with a certain age of onset.
Symptom-level: symptom_p_threshold: Uncorrected p-value threshold to filter cell type-symptom associations by.

Symptom-level: symptom_intersection_threshold:


Minimum proportion of genes overlapping between a symptom gene list
 (phenotype-associated genes in the context of a particular disease)
 and the phenotype-cell type association driver genes.

Cell type-level: q_threshold:


Keep only cell type-phenotype association results at q<=0.05.

Cell type-level: effect_threshold: Keep only cell type-phenotype association results at effect size>=1.
Cell type-level: keep_celltypes: Keep only terminally differentiated cell types.
Gene-level: keep_chr: Remove genes on non-standard chromosomes.

Gene-level: evidence_score_threshold:


Remove genes that are below an aggregate phenotype-gene
 evidence score threshold.

Gene-level: gene_size: Keep only genes <4.3kb in length.

Gene-level: add_driver_genes:


Keep only genes that are driving the association with a given phenotype
 (inferred by the intersection of phenotype-associated genes and gene with
 high-specificity quantiles in the target cell type).

Gene-level: keep_biotypes: Keep only genes belonging to certain biotypes.

Gene-level: gene_frequency_threshold:


Keep only genes at or above a certain mean frequency threshold
 (i.e. how frequently a gene is associated with a given phenotype
 when observed within a disease).

Gene-level: keep_specificity_quantiles:


Keep only genes in top specificity quantiles
 from the cell type dataset (CTD).

Gene-level: keep_mean_exp_quantiles:


Keep only genes in top mean expression quantiles
 from the cell type dataset (CTD).

Gene-level: symptom_gene_overlap:


Ensure that genes nominated at the phenotype-level also
 appear in the genes overlapping at the cell type-specific symptom-level.

All levels: sort_cols:


Sort candidate targets by one or more columns
 (e.g. "severity_score_gpt", "q").

All levels: top_n:


Only return the top N targets per variable group
 (specified with the "group_vars" argument).
 For example, setting "group_vars" to "hpo_id" and "top_n" to 1 would
 only return one target (row) per phenotype ID after sorting.

prioritise_targets(
  results = load_example_results(),
  ctd_list = load_example_ctd(c("ctd_DescartesHuman.rds", "ctd_HumanCellLandscape.rds"),
    multi_dataset = TRUE),
  phenotype_to_genes = HPOExplorer::load_phenotype_to_genes(),
  hpo = HPOExplorer::get_hpo(),
  keep_deaths = HPOExplorer::list_deaths(exclude = c("Miscarriage", "Stillbirth",
    "Prenatal death"), include_na = TRUE),
  keep_descendants = c("Phenotypic abnormality"),
  keep_ont_levels = NULL,
  pheno_ndiseases_threshold = NULL,
  gpt_filters = NULL,
  severity_score_gpt_threshold = 20,
  keep_tiers = NULL,
  severity_threshold_max = NULL,
  info_content_threshold = 8,
  run_prune_ancestors = TRUE,
  severity_threshold = NULL,
  pheno_frequency_threshold = NULL,
  keep_onsets = HPOExplorer::list_onsets(include_na = TRUE),
  effect_var = "logFC",
  q_threshold = 0.05,
  effect_threshold = 1,
  symptom_intersection_threshold = 0.25,
  keep_celltypes = NULL,
  evidence_score_threshold = 15,
  keep_chr = c(seq(22), "X", "Y"),
  gene_size = list(min = 0, max = Inf),
  gene_frequency_threshold = NULL,
  keep_biotypes = NULL,
  keep_specificity_quantiles = seq(30, 40),
  keep_mean_exp_quantiles = seq(30, 40),
  sort_cols = c(severity_score_gpt = -1, q = 1, logFC = -1, specificity = -1, mean_exp =
    -1, pheno_freq_mean = -1, gene_freq_mean = -1, width = 1),
  top_n = NULL,
  group_vars = c("hpo_id"),
  return_report = TRUE,
  verbose = TRUE
)

Arguments

results: The cell type-phenotype enrichment results generated by gen_results and merged together with merge_results
ctd_list: A named list of CellTypeDataset objects each created with generate_celltype_data.
phenotype_to_genes: Output of load_phenotype_to_genes mapping phenotypes to gene annotations.
hpo: Human Phenotype Ontology object, loaded from get_ontology.
keep_deaths: The age of death associated with each HPO ID to keep. If >1 age of death is associated with the term, only the earliest age is considered. See add_death for details.
keep_descendants: Terms whose descendants should be kept (including themselves). Set to NULL (default) to skip this filtering step.
keep_ont_levels: Only keep phenotypes at certain absolute ontology levels to keep. See add_ont_lvl for details.
pheno_ndiseases_threshold: Filter phenotypes by the maximum number of diseases they are associated with.
gpt_filters: A named list of filters to apply to the GPT annotations.
severity_score_gpt_threshold: The minimum GPT severity score that a phenotype can have across any disease.
keep_tiers: Tiers from hpo_tiers to keep. Include NA if you wish to retain phenotypes that do not have any Tier assignment.
severity_threshold_max: The max severity score that a phenotype can have across any disease.
info_content_threshold: Minimum phenotype information content threshold.
run_prune_ancestors: Prune redundant ancestral terms if any of their descendants are present. Passes to prune_ancestors.
severity_threshold: Only keep phenotypes with a mean severity score (averaged across multiple associated diseases) below the set threshold. The severity score ranges from 1-4 where 1 is the MOST severe. Include NA if you wish to retain phenotypes that do not have any severity score.
pheno_frequency_threshold: Only keep phenotypes with frequency above the set threshold. Frequency ranges from 0-100 where 100 is a phenotype that occurs 100% of the time in all associated diseases. Include NA if you wish to retain phenotypes that do not have any frequency data. See add_pheno_frequency for details.
keep_onsets: The age of onset associated with each HPO ID to keep. If >1 age of onset is associated with the term, only the earliest age is considered. See add_onset for details.
effect_var: Name of the effect size column in the results.
q_threshold: The q value threshold to subset the results by.
effect_threshold: The minimum fold change in specific expression to subset the results by.
symptom_intersection_threshold: Minimum proportion of genes overlapping between a symptom gene list (phenotype-associated genes in the context of a particular disease) and the phenotype-cell type association driver genes
keep_celltypes: Cell type to keep.
evidence_score_threshold: The minimum threshold of mean evidence scores of each gene-phenotype association to keep.
keep_chr: Chromosomes to keep.
gene_size: Min/max gene size (important for therapeutics design).
gene_frequency_threshold: Only keep genes with frequency above the set threshold. Frequency ranges from 0-100 where 100 is a gene that occurs 100% of the time in a given phenotype. Include NA if you wish to retain genes that do not have any frequency data. See add_gene_frequency for details.
keep_biotypes: Which gene biotypes to keep. (e.g. "protein_coding", "processed_transcript", "snRNA", "lincRNA", "snoRNA", "IG_C_gene")
keep_specificity_quantiles: Which cell type specificity quantiles to keep (max quantile is 40).
keep_mean_exp_quantiles: Which cell type mean expression quantiles to keep (max quantile is 40).
sort_cols: How to sort the rows using setorderv. names(sort_cols) will be supplied to the cols= argument and values will be supplied to the order= argument.
top_n: Top N genes to keep when grouping by group_vars.
group_vars: Columns to group by when selecting top_n genes.
return_report: If TRUE, will return a named list containing a report that shows the number of phenotypes/celltypes/genes remaining after each filtering step.
verbose: Print messages.

Value

A data.table of the prioritised phenotype- and cell type-specific gene targets.

Details

Term key:

Disease:


A disease defined in the database
OMIM, DECIPHER and/or Orphanet.

Phenotype: A clinical feature associated with one or more diseases.

Symptom:


A phenotype within the context of a particular disease.
Within a given phenotype, there may be multiple symptoms with
 partially overlapping genetic mechanisms.

Assocation:


A cell type-specific enrichment test result conducted
at the disease-level, phenotype-level, or symptom-level.

Examples

results = load_example_results()[q<0.05]
out <- prioritise_targets(results=results)
#> Loading ctd_DescartesHuman.rds
#> Loading ctd_HumanCellLandscape.rds
#> Reading cached RDS file: phenotype_to_genes.txt
#> + Version: v2024-12-12
#> Prioritising gene targets.
#> Adding logFC column.
#> Adding HPO names.
#> Translating ontology terms to names.
#> Adding term definitions.
#> Adding information_content scores.
#> Reading cached RDS file: phenotype_to_genes.txt
#> + Version: v2024-12-12
#> Adding genes and disease IDs.
#> Prioritised targets: step='start' 
#>  - Rows: 10,938,355 
#>  - Phenotypes: 9,575 
#>  - Diseases: 8,805 
#>  - Cell types: 201 
#>  - Genes: 5,092
#> Filtering @ q-value <= 0.05
#> Prioritised targets: step='q_threshold' 
#>  - Rows: 10,938,355 
#>  - Phenotypes: 9,575 
#>  - Diseases: 8,805 
#>  - Cell types: 201 
#>  - Genes: 5,092
#> Filtering @ logFC >= 1
#> Prioritised targets: step='effect_threshold' 
#>  - Rows: 16,235 
#>  - Phenotypes: 637 
#>  - Diseases: 2,664 
#>  - Cell types: 130 
#>  - Genes: 2,364
#> Annotating phenos with AgeOfDeath.
#> Translating ontology terms to ids.
#> Prioritised targets: step='keep_deaths' 
#>  - Rows: 16,209 
#>  - Phenotypes: 633 
#>  - Diseases: 2,657 
#>  - Cell types: 130 
#>  - Genes: 2,362
#> Adding level-2 ancestor to each HPO ID.
#> Adding ancestor metadata.
#> Ancestor metadata already present. Use force_new=TRUE to overwrite.
#> Translating ontology terms to ids.
#> Keeping descendants of 1 term(s).
#> 18,379 terms remain after filtering.
#> 16,198 associations remain after filtering.
#> Prioritised targets: step='keep_descendants' 
#>  - Rows: 16,198 
#>  - Phenotypes: 627 
#>  - Diseases: 2,653 
#>  - Cell types: 130 
#>  - Genes: 2,362
#> Getting absolute ontology level for 19,025 IDs.
#> Prioritised targets: step='keep_ont_levels' 
#>  - Rows: 16,198 
#>  - Phenotypes: 627 
#>  - Diseases: 2,653 
#>  - Cell types: 130 
#>  - Genes: 2,362
#> Translating ontology terms to ids.
#> Reading cached RDS file: phenotype_to_genes.txt
#> + Version: v2024-12-12
#> 383 phenotypes do not have matching HPO IDs.
#> Reading in GPT annotations for 16,753 phenotypes.
#> Prioritised targets: step='gpt_filters' 
#>  - Rows: 16,198 
#>  - Phenotypes: 627 
#>  - Diseases: 2,653 
#>  - Cell types: 130 
#>  - Genes: 2,362
#> Prioritised targets: step='severity_score_gpt_threshold' 
#>  - Rows: 5,157 
#>  - Phenotypes: 258 
#>  - Diseases: 1,459 
#>  - Cell types: 89 
#>  - Genes: 1,457
#> Prioritised targets: step='info_content_threshold' 
#>  - Rows: 3,605 
#>  - Phenotypes: 219 
#>  - Diseases: 1,040 
#>  - Cell types: 88 
#>  - Genes: 1,172
#> Annotating phenos with onset.
#> Translating ontology terms to ids.
#> Prioritised targets: step='keep_onsets' 
#>  - Rows: 3,605 
#>  - Phenotypes: 219 
#>  - Diseases: 1,040 
#>  - Cell types: 88 
#>  - Genes: 1,172
#> Annotating phenos with Tiers.
#> Prioritised targets: step='keep_tiers' 
#>  - Rows: 3,605 
#>  - Phenotypes: 219 
#>  - Diseases: 1,040 
#>  - Cell types: 88 
#>  - Genes: 1,172
#> Annotating phenos with modifiers
#> Prioritised targets: step='severity_threshold' 
#>  - Rows: 3,605 
#>  - Phenotypes: 219 
#>  - Diseases: 1,040 
#>  - Cell types: 88 
#>  - Genes: 1,172
#> Prioritised targets: step='severity_threshold_max' 
#>  - Rows: 3,605 
#>  - Phenotypes: 219 
#>  - Diseases: 1,040 
#>  - Cell types: 88 
#>  - Genes: 1,172
#> Annotating phenos with n_diseases
#> Reading cached RDS file: phenotype_to_genes.txt
#> + Version: v2024-12-12
#> Reading cached RDS file: genes_to_phenotype.txt
#> + Version: v2024-12-12
#> Reading cached RDS file: phenotype.hpoa
#> + Version: v2024-12-12
#> Prioritised targets: step='pheno_ndiseases_threshold' 
#>  - Rows: 3,605 
#>  - Phenotypes: 219 
#>  - Diseases: 1,040 
#>  - Cell types: 88 
#>  - Genes: 1,172
#> Annotating phenotype frequencies.
#> Prioritised targets: step='pheno_frequency_threshold' 
#>  - Rows: 3,614 
#>  - Phenotypes: 219 
#>  - Diseases: 1,040 
#>  - Cell types: 88 
#>  - Genes: 1,172
#> Prioritised targets: step='keep_celltypes' 
#>  - Rows: 3,614 
#>  - Phenotypes: 219 
#>  - Diseases: 1,040 
#>  - Cell types: 88 
#>  - Genes: 1,172
#> Converting phenos to GRanges.
#> Loading required namespace: ensembldb
#> Gathering metadata for 1172 unique genes.
#> Loading required namespace: EnsDb.Hsapiens.v75
#> Prioritised targets: step='symptom_gene_overlap' 
#>  - Rows: 3,235 
#>  - Phenotypes: 219 
#>  - Diseases: 997 
#>  - Cell types: 88 
#>  - Genes: 1,097
#> Filtering by keep_chr.
#> Prioritised targets: step='keep_chr' 
#>  - Rows: 3,235 
#>  - Phenotypes: 219 
#>  - Diseases: 997 
#>  - Cell types: 88 
#>  - Genes: 1,097
#> Filtering by gene-disease association evidence.
#> Annotating gene-disease associations with Evidence Score
#> Gathering data from GenCC.
#> Importing cached file.
#> Evidence scores for: 
#>  - 10514 diseases 
#>  - 5171 genes
#> + Version: 2024-12-19
#> Prioritised targets: step='evidence_score_threshold' 
#>  - Rows: 288 
#>  - Phenotypes: 70 
#>  - Diseases: 138 
#>  - Cell types: 36 
#>  - Genes: 139
#> Filtering by gene size.
#> 139 / 139 genes kept.
#> Prioritised targets: step='gene_size' 
#>  - Rows: 288 
#>  - Phenotypes: 70 
#>  - Diseases: 138 
#>  - Cell types: 36 
#>  - Genes: 139
#> Prioritised targets: step='keep_biotypes' 
#>  - Rows: 288 
#>  - Phenotypes: 70 
#>  - Diseases: 138 
#>  - Cell types: 36 
#>  - Genes: 139
#> Reading cached RDS file: phenotype_to_genes.txt
#> + Version: v2024-12-12
#> Prioritised targets: step='add_driver_genes' 
#>  - Rows: 202 
#>  - Phenotypes: 65 
#>  - Diseases: 99 
#>  - Cell types: 33 
#>  - Genes: 97
#> Adding symptom-level results.
#> Subsetting results by q_threshold and effect.
#> 202 associations remain after filtering.
#> Reading cached RDS file: phenotype_to_genes.txt
#> + Version: v2024-12-12
#> Dropping subthreshold symptoms.
#> Prioritised targets: step='symptom_intersection_threshold' 
#>  - Rows: 202 
#>  - Phenotypes: 65 
#>  - Diseases: 99 
#>  - Cell types: 33 
#>  - Genes: 97
#> Annotating gene frequencies.
#> Reading cached RDS file: genes_to_phenotype.txt
#> + Version: v2024-12-12
#> Prioritised targets: step='gene_frequency_threshold' 
#>  - Rows: 246 
#>  - Phenotypes: 65 
#>  - Diseases: 99 
#>  - Cell types: 33 
#>  - Genes: 97
#> Pruning ancestors.
#> 58 / 65 terms were kept after pruning.
#> Prioritised targets: step='prune_ancestors' 
#>  - Rows: 215 
#>  - Phenotypes: 58 
#>  - Diseases: 99 
#>  - Cell types: 33 
#>  - Genes: 97
#> Sorting rows.
#> Prioritised targets: step='end' 
#>  - Rows: 215 
#>  - Phenotypes: 58 
#>  - Diseases: 99 
#>  - Cell types: 33 
#>  - Genes: 97
#> Adding disease_name and disease_description.