Prioritise target genes based on a procedure:

  1. Disease-level: keep_deaths: Keep only diseases with a certain age of death.

  2. Phenotype-level: remove_descendants: Remove phenotypes belonging to a certain branch of the HPO, as defined by an ancestor term.

  3. Phenotype-level: keep_ont_levels: Keep only phenotypes at certain absolute ontology levels within the HPO.

  4. Phenotype-level: pheno_ndiseases_threshold: The maximum number of diseases each phenotype can be associated with.

  5. Phenotype-level: keep_tiers: Keep only phenotypes with high severity Tiers.

  6. Phenotype-level: severity_threshold: Keep only phenotypes with mean Severity equal to or below the threshold.

  7. Symptom-level: pheno_frequency_threshold: Keep only phenotypes with mean frequency equal to or above the threshold (i.e. how frequently a phenotype is associated with any diseases in which it occurs).

  8. Symptom-level: keep_onsets: Keep only symptoms with a certain age of onset.

  9. Symptom-level: symptom_p_threshold: Uncorrected p-value threshold to filter cell type-symptom associations by.

  10. Symptom-level: symptom_intersection_size_threshold: Minimum number of genes overlapping between a symptom gene list (phenotype-associated genes in the context of a particular disease) and the celltype (genes in top 10/40 specificity quantiles).

  11. Cell type-level: q_threshold: Keep only cell type-phenotype association results at q<=0.05.

  12. Cell type-level: fold_threshold: Keep only cell type-phenotype association results at fold_change>=1.

  13. Cell type-level: keep_celltypes: Keep only terminally differentiated cell types.

  14. Gene-level: keep_seqnames: Remove genes on non-standard chromosomes.

  15. Gene-level: gene_size: Keep only genes <4.3kb in length.

  16. Gene-level: keep_biotypes: Keep only genes belonging to certain biotypes.

  17. Gene-level: gene_frequency_threshold: Keep only genes at or above a certain mean frequency threshold (i.e. how frequently a gene is associated with a given phenotype when observed within a disease).

  18. Gene-level: keep_specificity_quantiles: Keep only genes in top specificity quantiles from the cell type dataset (ctd).

  19. Gene-level: keep_mean_exp_quantiles: Keep only genes in top mean expression quantiles from the cell type dataset (ctd).

  20. Gene-level: symptom_gene_overlap: Ensure that genes nominated at the phenotype-level also appear in the genes overlapping at the cell type-specific symptom-level.

  21. All levels: top_n: Sort candidate targets by a preferred order of metrics and only return the top N targets per cell type-phenotype combination.

prioritise_targets(
  results = load_example_results(),
  ctd = load_example_ctd(),
  annotLevel = 1,
  phenotype_to_genes = HPOExplorer::load_phenotype_to_genes(),
  hpo = HPOExplorer::get_hpo(),
  keep_deaths = HPOExplorer::list_deaths(exclude = c("Miscarriage", "Stillbirth",
    "Prenatal death"), include_na = FALSE),
  remove_descendants = c("Clinical course"),
  keep_ont_levels = NULL,
  pheno_ndiseases_threshold = NULL,
  keep_tiers = c(1, 2, NA),
  severity_threshold_max = NULL,
  severity_threshold = c(2, NA),
  pheno_frequency_threshold = NULL,
  keep_onsets = HPOExplorer::list_onsets(exclude = c("Antenatal", "Fetal", "Congenital"),
    include_na = TRUE),
  q_threshold = 0.05,
  fold_threshold = 2,
  symptom_p_threshold = NULL,
  symptom_intersection_size_threshold = 1,
  keep_celltypes = terminal_celltypes()$CellType,
  keep_evidence = seq(3, 6),
  keep_seqnames = c(seq(22), "X", "Y"),
  gene_size = list(min = 0, max = Inf),
  gene_frequency_threshold = NULL,
  keep_biotypes = NULL,
  keep_specificity_quantiles = NULL,
  keep_mean_exp_quantiles = seq(1, 40),
  symptom_gene_overlap = TRUE,
  sort_cols = c(tier_merge = 1, Severity_score_mean = 1, q = 1, fold_change = -1,
    specificity = -1, mean_exp = -1, pheno_freq_mean = -1, gene_freq_mean = -1, width =
    1),
  top_n = NULL,
  group_vars = c("disease_id", "hpo_id", "CellType"),
  return_report = TRUE,
  verbose = TRUE
)

Arguments

results

The cell type-phenotype enrichment results generated by gen_results and merged together with merge_results.

ctd

Cell Type Data List generated using generate_celltype_data.

annotLevel

An integer indicating which level of sct_data to analyse (Default: 1).

phenotype_to_genes

Output of load_phenotype_to_genes mapping phenotypes to gene annotations.

hpo

Human Phenotype Ontology object, loaded from ontologyIndex.

keep_deaths

The age of death associated with each HPO ID to keep. If >1 age of death is associated with the term, only the earliest age is considered. See add_death for details.

remove_descendants

Remove HPO terms that are descendants of a given ancestral HPO term. Ancestral terms be provided as a character vector of phenotype names (e.g. c("Clinical course")), HPO IDs (e.g. "HP:0031797" ) or a mixture of the two. See add_ancestor for details.

keep_ont_levels

Only keep phenotypes at certain absolute ontology levels to keep. See add_ont_lvl for details.

pheno_ndiseases_threshold

Filter phenotypes by the maximum number of diseases they are associated with.

keep_tiers

Tiers from hpo_tiers to keep. Include NA if you wish to retain phenotypes that do not have any Tier assignment.

severity_threshold_max

The max severity score that a phenotype can have across any disease.

severity_threshold

Only keep phenotypes with a mean severity score (averaged across multiple associated diseases) below the set threshold. The severity score ranges from 1-4 where 1 is the MOST severe. Include NA if you wish to retain phenotypes that do not have any severity score.

pheno_frequency_threshold

Only keep phenotypes with frequency above the set threshold. Frequency ranges from 0-100 where 100 is a phenotype that occurs 100% of the time in all associated diseases. Include NA if you wish to retain phenotypes that do not have any frequency data. See add_pheno_frequency for details.

keep_onsets

The age of onset associated with each HPO ID to keep. If >1 age of onset is associated with the term, only the earliest age is considered. See add_onset for details.

q_threshold

The q value threshold to subset the results by.

fold_threshold

The minimum fold change in specific expression to subset the results by.

symptom_p_threshold

The p-value threshold of celltype-symptom enrichment results (using gen_overlap). Here, "symptoms" are defined as the presentation of a phenotype in the context of a particular disease. In other words: phenotype (hpo_id) + disease (disease_id) = symptom (hpo_id.disease_id)

symptom_intersection_size_threshold

The minimum number of intersecting genes between a symptom and a celltype to consider it a significant enrichment. Refers to the result from gen_overlap.

keep_celltypes

Cell type to keep.

keep_evidence

The evidence scores of each gene-disease association to keep.

keep_seqnames

Chromosomes to keep.

gene_size

Min/max gene size (important for therapeutics design).

gene_frequency_threshold

Only keep genes with frequency above the set threshold. Frequency ranges from 0-100 where 100 is a gene that occurs 100% of the time in a given phenotype. Include NA if you wish to retain genes that do not have any frequency data. See add_gene_frequency for details.

keep_biotypes

Which gene biotypes to keep. (e.g. "protein_coding", "processed_transcript", "snRNA", "lincRNA", "snoRNA", "IG_C_gene")

keep_specificity_quantiles

Which cell type specificity quantiles to keep (max quantile is 40).

keep_mean_exp_quantiles

Which cell type mean expression quantiles to keep (max quantile is 40).

symptom_gene_overlap

The gene for a particular symptom (phenotype + disease) must appear in the celltype-symptom enrichment results.

sort_cols

How to sort the rows using setorderv. names(sort_cols) will be supplied to the cols= argument and values will be supplied to the order= argument.

top_n

Top N genes to keep when grouping by group_vars.

group_vars

Columns to group by when selecting top_n genes.

return_report

If TRUE, will return a named list containing a report that shows the number of phenotypes/celltypes/genes remaining after each filtering step.

verbose

Print messages.

Value

A data.table of the prioritised phenotype- and celltype-specific gene targets.

Details

Term key:

  • Disease: A disease defined in the database OMIM, DECIPHER and/or Orphanet.

  • Phenotype: A clinical feature associated with one or more diseases.

  • Symptom: A phenotype within the context of a particular disease. Within a given phenotype, there may be multiple symptoms with partially overlapping genetic mechanisms.

  • Assocation: A cell type-specific enrichment test result conducted at the disease-level, phenotype-level, or symptom-level.

Examples

results = load_example_results()[seq(5000),]
res <- prioritise_targets(results=results)
#> Prioritising gene targets.
#> Adding term definitions.
#>  All local files already up-to-date!
#> Annotating phenos with Disease
#> Reading cached RDS file: phenotype.hpoa
#> + Version: v2023-10-09
#> Adding disease metadata: Definitions, Preferred.Label
#> Importing Orphanet metadata.
#> Importing OMIM metadata.
#> 1178 / 2469 (47.71%) disease_name missing.
#> 1867 / 2335 (79.96%) Definitions missing.
#> Annotating phenos with MONDO metadata.
#>  All local files already up-to-date!
#> 7 / 2335 (0.3%) MONDO_ID missing.
#> 2037 / 2335 (87.24%) MONDO_name missing.
#> 2289 / 2335 (98.03%) MONDO_definition missing.
#> 1828 / 2335 (78.29%) Definitions missing.
#> Prioritised targets: step='start' 
#>  - Rows: 5,000 
#>  - Phenotypes: 3 
#>  - Diseases: 2,335 
#>  - Cell types: 14
#> Filtering @ q-value <= 0.05
#> Prioritised targets: step='q_threshold' 
#>  - Rows: 5,000 
#>  - Phenotypes: 3 
#>  - Diseases: 2,335 
#>  - Cell types: 14
#> Filtering @ fold-change >= 2
#> Prioritised targets: step='fold_threshold' 
#>  - Rows: 20 
#>  - Phenotypes: 1 
#>  - Diseases: 20 
#>  - Cell types: 1
#> Prioritised targets: step='symptom_p_threshold' 
#>  - Rows: 20 
#>  - Phenotypes: 1 
#>  - Diseases: 20 
#>  - Cell types: 1
#> Prioritised targets: step='symptom_intersection_size_threshold' 
#>  - Rows: 20 
#>  - Phenotypes: 1 
#>  - Diseases: 20 
#>  - Cell types: 1
#> Annotating phenos with AgeOfDeath.
#> Translating all phenotypes to HPO IDs.
#>  All local files already up-to-date!
#> + Returning a dictionary of phenotypes (different order as input).
#> Prioritised targets: step='keep_deaths' 
#>  - Rows: 1 
#>  - Phenotypes: 1 
#>  - Diseases: 1 
#>  - Cell types: 1
#> Adding level-3 ancestor to each HPO ID.
#>  All local files already up-to-date!
#> Removing remove descendants of: 'Clinical course'
#> Translating all phenotypes to HPO IDs.
#> + Returning a dictionary of phenotypes (different order as input).
#> Prioritised targets: step='remove_descendants' 
#>  - Rows: 1 
#>  - Phenotypes: 1 
#>  - Diseases: 1 
#>  - Cell types: 1
#> Getting absolute ontology level for 1 HPO IDs.
#>  All local files already up-to-date!
#> Prioritised targets: step='keep_ont_levels' 
#>  - Rows: 1 
#>  - Phenotypes: 1 
#>  - Diseases: 1 
#>  - Cell types: 1
#> Annotating phenos with onset.
#> Translating all phenotypes to HPO IDs.
#>  All local files already up-to-date!
#> + Returning a dictionary of phenotypes (different order as input).
#> Prioritised targets: step='keep_onsets' 
#>  - Rows: 1 
#>  - Phenotypes: 1 
#>  - Diseases: 1 
#>  - Cell types: 1
#> Annotating phenos with Tiers.
#> Prioritised targets: step='keep_tiers' 
#>  - Rows: 1 
#>  - Phenotypes: 1 
#>  - Diseases: 1 
#>  - Cell types: 1
#> Annotating phenos with modifiers
#> Prioritised targets: step='severity_threshold' 
#>  - Rows: 1 
#>  - Phenotypes: 1 
#>  - Diseases: 1 
#>  - Cell types: 1
#> Prioritised targets: step='severity_threshold_max' 
#>  - Rows: 1 
#>  - Phenotypes: 1 
#>  - Diseases: 1 
#>  - Cell types: 1
#> Annotating phenos with n_diseases
#> Reading cached RDS file: phenotype_to_genes.txt
#> + Version: v2023-10-09
#> Reading cached RDS file: genes_to_phenotype.txt
#> + Version: v2023-10-09
#> Reading cached RDS file: phenotype.hpoa
#> + Version: v2023-10-09
#> Prioritised targets: step='pheno_ndiseases_threshold' 
#>  - Rows: 1 
#>  - Phenotypes: 1 
#>  - Diseases: 1 
#>  - Cell types: 1
#> Annotating phenotype frequencies.
#> Prioritised targets: step='pheno_frequency_threshold' 
#>  - Rows: 1 
#>  - Phenotypes: 1 
#>  - Diseases: 1 
#>  - Cell types: 1
#> 1 / 1 of cell types kept.
#> Prioritised targets: step='keep_celltypes' 
#>  - Rows: 1 
#>  - Phenotypes: 1 
#>  - Diseases: 1 
#>  - Cell types: 1
#> Converting phenos to GRanges.
#> Reading cached RDS file: phenotype_to_genes.txt
#> + Version: v2023-10-09
#> Loading required namespace: ensembldb
#> Gathering metadata for 2 unique genes.
#> Loading required namespace: EnsDb.Hsapiens.v75
#> Prioritised targets: step='symptom_gene_overlap' 
#>  - Rows: 2 
#>  - Phenotypes: 1 
#>  - Diseases: 1 
#>  - Cell types: 1 
#>  - Genes: 2
#> Filtering by keep_seqnames.
#> Prioritised targets: step='keep_seqnames' 
#>  - Rows: 2 
#>  - Phenotypes: 1 
#>  - Diseases: 1 
#>  - Cell types: 1 
#>  - Genes: 2
#> Filtering by gene-disease association evidence.
#> Annotating gene-disease associations with Evidence score
#> Gathering data from GenCC.
#> Importing cached file.
#> + Version: 2023-11-14
#> Prioritised targets: step='keep_evidence' 
#>  - Rows: 2 
#>  - Phenotypes: 1 
#>  - Diseases: 1 
#>  - Cell types: 1 
#>  - Genes: 2
#> Filtering by gene size.
#> 2 / 2 genes kept.
#> Prioritised targets: step='gene_size' 
#>  - Rows: 2 
#>  - Phenotypes: 1 
#>  - Diseases: 1 
#>  - Cell types: 1 
#>  - Genes: 2
#> Prioritised targets: step='keep_biotypes' 
#>  - Rows: 2 
#>  - Phenotypes: 1 
#>  - Diseases: 1 
#>  - Cell types: 1 
#>  - Genes: 2
#> Reading cached RDS file: phenotype_to_genes.txt
#> + Version: v2023-10-09
#> Prioritised targets: step='keep_specificity_quantiles' 
#>  - Rows: 2 
#>  - Phenotypes: 1 
#>  - Diseases: 1 
#>  - Cell types: 1 
#>  - Genes: 2
#> Prioritised targets: step='keep_mean_exp_quantiles' 
#>  - Rows: 2 
#>  - Phenotypes: 1 
#>  - Diseases: 1 
#>  - Cell types: 1 
#>  - Genes: 2
#> Annotating gene frequencies.
#> Reading cached RDS file: genes_to_phenotype.txt
#> + Version: v2023-10-09
#> Prioritised targets: step='gene_frequency_threshold' 
#>  - Rows: 2 
#>  - Phenotypes: 1 
#>  - Diseases: 1 
#>  - Cell types: 1 
#>  - Genes: 2
#> Sorting rows.
#> Prioritised targets: step='end' 
#>  - Rows: 2 
#>  - Phenotypes: 1 
#>  - Diseases: 1 
#>  - Cell types: 1 
#>  - Genes: 2