vignettes/orthogene.Rmd
orthogene.Rmd
orthogene
: Interspecies gene mapping
orthogene
is an R package for easy mapping of
orthologous genes across hundreds of species.
It pulls up-to-date interspecies gene ortholog mappings across 700+
organisms.
It also provides various utility functions to map common objects (e.g. data.frames, gene expression matrices, lists) onto 1:1 gene orthologs from any other species.
In brief, orthogene
lets you easily:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# orthogene is only available on Bioconductor>=3.14
if(BiocManager::version()<"3.14") BiocManager::install(version = "3.14")
BiocManager::install("orthogene")
convert_orthologs
is very flexible with what users can supply as gene_df
, and
can take a
data.frame
/data.table
/tibble
,
(sparse) matrix
, or list
/vector
containing genes.
Genes, transcripts, proteins, SNPs, or genomic ranges will be recognised in most formats (HGNC, Ensembl, RefSeq, UniProt, etc.) and can even be a mixture of different formats.
All genes will be mapped to gene symbols, unless specified otherwise
with the ...
arguments (see
?orthogene::convert_orthologs
or here
for details).
A key feature of convert_orthologs
is that it handles the issue of genes with many-to-many mappings across
species. This can occur due to evolutionary divergence, and the function
of these genes tends to be less conserved and less translatable. Users
can address this using different strategies via
non121_strategy=
:
"drop_both_species"
: Drop genes that have duplicate
mappings in either the input_species or output_species,
(DEFAULT)."drop_input_species"
: Only drop genes that have
duplicate mappings in input_species
."drop_output_species"
: Only drop genes that have
duplicate mappings in the output_species
."keep_both_species"
: Keep all genes regardless of
whether they have duplicate mappings in either species."keep_popular"
: Return only the most “popular”
interspecies ortholog mappings. This procedure tends to yield a greater
number of returned genes but at the cost of many of them not being true
biological 1:1 orthologs.When gene_df
is a matrix. These strategies can be used
together with agg_fun
. This feature automatically performs
both ortholog aggregation (many:1 mappings) and expansion (1:many
mappings) of matrices, depending on the situation. This means that you
have the option to keep non-1:1 ortholog genes, and still produce a
matrix with only 1 gene per row. Options include: 1. "sum"
2. "mean"
3. "median"
4. "min"
5.
"max"
For more information on how orthogene
performs matrix
aggregation/expansion, see the documentation for the underlying
function: ?orthogene:::many2many_rows
gene_df <- orthogene::convert_orthologs(gene_df = exp_mouse,
gene_input = "rownames",
gene_output = "rownames",
input_species = "mouse",
output_species = "human",
non121_strategy = "drop_both_species",
method = method)
## Preparing gene_df.
## sparseMatrix format detected.
## Extracting genes from rownames.
## 15,259 genes extracted.
## Converting mouse ==> human orthologs using: homologene
## Retrieving all organisms available in homologene.
## Mapping species name: mouse
## Common name mapping found for mouse
## 1 organism identified from search: 10090
## Retrieving all organisms available in homologene.
## Mapping species name: human
## Common name mapping found for human
## 1 organism identified from search: 9606
## Checking for genes without orthologs in human.
## Extracting genes from input_gene.
## 13,416 genes extracted.
## Extracting genes from ortholog_gene.
## 13,416 genes extracted.
## Checking for genes without 1:1 orthologs.
## Dropping 46 genes that have multiple input_gene per ortholog_gene (many:1).
## Dropping 56 genes that have multiple ortholog_gene per input_gene (1:many).
## Filtering gene_df with gene_map
## Setting ortholog_gene to rownames.
##
## =========== REPORT SUMMARY ===========
## Total genes dropped after convert_orthologs :
## 2,016 / 15,259 (13%)
## Total genes remaining after convert_orthologs :
## 13,243 / 15,259 (87%)
astrocytes_ependymal | endothelial-mural | interneurons | microglia | oligodendrocytes | pyramidal CA1 | pyramidal SS | |
---|---|---|---|---|---|---|---|
TSPAN12 | 0.3303571 | 0.5872340 | 0.6413793 | 0.1428571 | 0.1207317 | 0.2864750 | 0.1453634 |
TSHZ1 | 0.4285714 | 0.4468085 | 1.1551724 | 0.4387755 | 0.3621951 | 0.0692226 | 0.8320802 |
ADAMTS15 | 0.0089286 | 0.0978723 | 0.2206897 | 0.0000000 | 0.0231707 | 0.0117146 | 0.0375940 |
CLDN12 | 0.2232143 | 0.1148936 | 0.5517241 | 0.0510204 | 0.2609756 | 0.4376997 | 0.6842105 |
RXFP1 | 0.0000000 | 0.0127660 | 0.2551724 | 0.0000000 | 0.0158537 | 0.0511182 | 0.0751880 |
SEMA3C | 0.1964286 | 0.9957447 | 8.6379310 | 0.2040816 | 0.1853659 | 0.1608094 | 0.2280702 |
map_species
lets you standardise species names from a wide variety of identifiers
(e.g. common name, taxonomy ID, Latin name, partial match).
All exposed orthogene
functions (including convert_orthologs
)
use map_species
under the hood, so you don’t have to worry
about getting species names exactly right.
You can check the full list of available species by simply running
map_species()
with no arguments, or checking here.
species <- orthogene::map_species(species = c("human",9544,"mus musculus",
"fruit fly","Celegans"),
output_format = "scientific_name")
## Retrieving all organisms available in homologene.
## Mapping species name: human
## Common name mapping found for human
## 1 organism identified from search: Homo sapiens
## Mapping species name: 9544
## 1 organism identified from search: Macaca mulatta
## Mapping species name: mus musculus
## 1 organism identified from search: Mus musculus
## Mapping species name: fruit fly
## Common name mapping found for fruit fly
## 1 organism identified from search: Drosophila melanogaster
## Mapping species name: Celegans
## 1 organism identified from search: Caenorhabditis elegans
print(species)
## human 9544 mus musculus
## "Homo sapiens" "Macaca mulatta" "Mus musculus"
## fruit fly Celegans
## "Drosophila melanogaster" "Caenorhabditis elegans"
It may be helpful to know the maximum expected number of orthologous gene mappings from one species to another.
ortholog_report
generates a report that tells you this information genome-wide.
orth_zeb <- orthogene::report_orthologs(target_species = "zebrafish",
reference_species = "human",
method_all_genes = method,
method_convert_orthologs = method)
## Gathering ortholog reports.
## Retrieving all genes using: homologene.
## Retrieving all organisms available in homologene.
## Mapping species name: human
## Common name mapping found for human
## 1 organism identified from search: 9606
## Gene table with 19,129 rows retrieved.
## Returning all 19,129 genes from human.
## Retrieving all genes using: homologene.
## Retrieving all organisms available in homologene.
## Mapping species name: zebrafish
## Common name mapping found for zebrafish
## 1 organism identified from search: 7955
## Gene table with 20,897 rows retrieved.
## Returning all 20,897 genes from zebrafish.
## --
## --
## Preparing gene_df.
## data.frame format detected.
## Extracting genes from Gene.Symbol.
## 20,897 genes extracted.
## Converting zebrafish ==> human orthologs using: homologene
## Retrieving all organisms available in homologene.
## Mapping species name: zebrafish
## Common name mapping found for zebrafish
## 1 organism identified from search: 7955
## Retrieving all organisms available in homologene.
## Mapping species name: human
## Common name mapping found for human
## 1 organism identified from search: 9606
## Checking for genes without orthologs in human.
## Extracting genes from input_gene.
## 14,768 genes extracted.
## Extracting genes from ortholog_gene.
## 14,768 genes extracted.
## Checking for genes without 1:1 orthologs.
## Dropping 46 genes that have multiple input_gene per ortholog_gene (many:1).
## Dropping 2,707 genes that have multiple ortholog_gene per input_gene (1:many).
## Filtering gene_df with gene_map
## Adding input_gene col to gene_df.
## Adding ortholog_gene col to gene_df.
##
## =========== REPORT SUMMARY ===========
## Total genes dropped after convert_orthologs :
## 10,336 / 20,895 (49%)
## Total genes remaining after convert_orthologs :
## 10,559 / 20,895 (51%)
## --
##
## =========== REPORT SUMMARY ===========
## 10,557 / 20,895 (50.52%) target_species genes remain after ortholog conversion.
## 10,557 / 19,129 (55.19%) reference_species genes remain after ortholog conversion.
input_species | target_species | reference_species | input_gene | ortholog_gene |
---|---|---|---|---|
zebrafish | danio rerio | homo sapiens | acadm | ACADM |
zebrafish | danio rerio | homo sapiens | acadvl | ACADVL |
zebrafish | danio rerio | homo sapiens | acat1 | ACAT1 |
zebrafish | danio rerio | homo sapiens | acvr1l | ACVR1 |
zebrafish | danio rerio | homo sapiens | adsl | ADSL |
zebrafish | danio rerio | homo sapiens | aga | AGA |
knitr::kable(orth_zeb$report)
input_species | target_species | target_total_genes | reference_species | reference_total_genes | one2one_orthologs | target_percent | reference_percent |
---|---|---|---|---|---|---|---|
zebrafish | danio rerio | 20895 | homo sapiens | 19129 | 10557 | 50.52 | 55.19 |
map_genes
finds matching within-species synonyms across a wide variety of
gene naming conventions (HGNC, Ensembl, RefSeq, UniProt, etc.) and
returns a table with standardised gene symbols (or whatever output
format you prefer).
genes <- c("Klf4", "Sox2", "TSPAN12","NM_173007","Q8BKT6",9999,
"ENSMUSG00000012396","ENSMUSG00000074637")
mapped_genes <- orthogene::map_genes(genes = genes,
species = "mouse",
drop_na = FALSE)
## Retrieving all organisms available in gprofiler.
## Using stored `gprofiler_orgs`.
## Mapping species name: mouse
## Common name mapping found for mouse
## 1 organism identified from search: mmusculus
## 7 / 8 (87.5%) genes mapped.
input_number | input | target_number | target | name | description | namespace |
---|---|---|---|---|---|---|
1 | Klf4 | 1.1 | ENSMUSG00000003032 | Klf4 | Kruppel-like factor 4 (gut) [Source:MGI Symbol;Acc:MGI:1342287] | ENTREZGENE,MGI,UNIPROT_GN,WIKIGENE |
2 | Sox2 | 2.1 | ENSMUSG00000074637 | Sox2 | SRY (sex determining region Y)-box 2 [Source:MGI Symbol;Acc:MGI:98364] | ENTREZGENE,MGI,UNIPROT_GN,WIKIGENE |
3 | TSPAN12 | 3.1 | ENSMUSG00000029669 | Tspan12 | tetraspanin 12 [Source:MGI Symbol;Acc:MGI:1889818] | ENTREZGENE,MGI,UNIPROT_GN,WIKIGENE |
4 | NM_173007 | 4.1 | ENSMUSG00000029669 | Tspan12 | tetraspanin 12 [Source:MGI Symbol;Acc:MGI:1889818] | REFSEQ_MRNA_ACC |
5 | Q8BKT6 | 5.1 | ENSMUSG00000029669 | Tspan12 | tetraspanin 12 [Source:MGI Symbol;Acc:MGI:1889818] | UNIPROTSWISSPROT_ACC,UNIPROT_GN_ACC |
6 | 9999 | 6.1 | NA | NA | NA |
aggregate_mapped_genes
does the following:
map_genes
to identify within-species
many-to-one gene mappings (e.g. Ensembl transcript IDs ==> gene
symbols). Alternatively, can map across species if output from
map_orthologs
is supplied to gene_map
argument
(and gene_map_col="ortholog_gene"
).gene_df
using
"sum"
,"mean"
,"median"
,"min"
or "max"
.Note, this only works when the input data (gene_df
) is a
sparse or dense matrix, and the genes are row names.
astrocytes_ependymal | endothelial-mural | interneurons | microglia | oligodendrocytes | pyramidal CA1 | pyramidal SS | |
---|---|---|---|---|---|---|---|
ENSMUST00000102875 | 2.8258910 | 0.4041560 | 1.3171987 | 0.3774840 | 1.3426606 | 1.0403481 | 1.0876508 |
ENSMUST00000133343 | 2.8259032 | 0.4042189 | 1.3171312 | 0.3774038 | 1.3425772 | 1.0403432 | 1.0876385 |
ENSMUST00000143890 | 2.8258554 | 0.4041963 | 1.3171145 | 0.3774192 | 1.3426119 | 1.0403496 | 1.0876334 |
ENSMUST00000005053 | 0.4597978 | 0.3403299 | 0.9067953 | 0.2958589 | 0.7254482 | 0.4813420 | 0.7418000 |
ENSMUST00000185896 | 0.4596631 | 0.3403637 | 0.9067538 | 0.2958896 | 0.7255006 | 0.4812783 | 0.7417918 |
ENSMUST00000188282 | 0.4597399 | 0.3403441 | 0.9067727 | 0.2957819 | 0.7255681 | 0.4811978 | 0.7417924 |
exp_agg <- orthogene::aggregate_mapped_genes(gene_df=exp_mouse_enst,
input_species="mouse",
agg_fun = "sum")
## Retrieving all organisms available in gprofiler.
## Using stored `gprofiler_orgs`.
## Mapping species name: mouse
## Common name mapping found for mouse
## 1 organism identified from search: mmusculus
## 482 / 482 (100%) genes mapped.
## Aggregating rows using: monocle3
## Converting obj to sparseMatrix.
## Matrix aggregated:
## - Input: 482 x 7
## - Output: 92 x 7
astrocytes_ependymal | endothelial-mural | interneurons | microglia | oligodendrocytes | pyramidal CA1 | pyramidal SS | |
---|---|---|---|---|---|---|---|
Tshz1 | 1.713936 | 1.7869498 | 4.620366 | 1.7545487 | 1.4483505 | 0.2764327 | 3.3280256 |
Tspan12 | 1.981594 | 3.5228954 | 3.847706 | 0.8565873 | 0.7237384 | 1.7184690 | 0.8716624 |
Ugp2 | 11.303434 | 1.6167531 | 5.268556 | 1.5096830 | 5.3705113 | 4.1614945 | 4.3505804 |
Usp28 | 1.561545 | 1.4885072 | 12.481956 | 0.9176950 | 1.0237324 | 5.5261972 | 6.4652509 |
Vat1l | 0.178117 | 0.0337314 | 1.199619 | 0.0812187 | 0.1165772 | 0.2339571 | 0.4006268 |
Wtap | 2.691383 | 2.4118074 | 12.289111 | 3.5809075 | 3.2808114 | 9.3443456 | 8.6384533 |
You can also quickly get all known genes from the genome of a given
species with all_genes
.
genome_mouse <- orthogene::all_genes(species = "mouse",
method = method)
## Retrieving all genes using: homologene.
## Retrieving all organisms available in homologene.
## Mapping species name: mouse
## Common name mapping found for mouse
## 1 organism identified from search: 10090
## Gene table with 21,207 rows retrieved.
## Returning all 21,207 genes from mouse.
HID | Gene.ID | Gene.Symbol | taxonomy_id | |
---|---|---|---|---|
6 | 3 | 11364 | Acadm | 10090 |
18 | 5 | 11370 | Acadvl | 10090 |
29 | 6 | 110446 | Acat1 | 10090 |
52 | 7 | 11477 | Acvr1 | 10090 |
64 | 9 | 20391 | Sgca | 10090 |
71 | 12 | 11564 | Adsl | 10090 |
utils::sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] orthogene_1.9.1 BiocStyle_2.30.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.4 babelgene_22.9
## [3] xfun_0.41 bslib_0.6.1
## [5] ggplot2_3.4.4 htmlwidgets_1.6.4
## [7] rstatix_0.7.2 lattice_0.22-5
## [9] bitops_1.0-7 vctrs_0.6.5
## [11] tools_4.3.2 generics_0.1.3
## [13] yulab.utils_0.1.1 parallel_4.3.2
## [15] tibble_3.2.1 fansi_1.0.6
## [17] pkgconfig_2.0.3 Matrix_1.6-1.1
## [19] ggplotify_0.1.2 data.table_1.14.10
## [21] homologene_1.4.68.19.3.27 desc_1.4.3
## [23] lifecycle_1.0.4 compiler_4.3.2
## [25] stringr_1.5.1 treeio_1.26.0
## [27] textshaping_0.3.7 munsell_0.5.0
## [29] carData_3.0-5 ggtree_3.10.0
## [31] gprofiler2_0.2.2 ggfun_0.1.3
## [33] htmltools_0.5.7 sass_0.4.8
## [35] RCurl_1.98-1.13 yaml_2.3.8
## [37] lazyeval_0.2.2 plotly_4.10.3
## [39] pillar_1.9.0 pkgdown_2.0.7
## [41] car_3.1-2 ggpubr_0.6.0
## [43] jquerylib_0.1.4 tidyr_1.3.0
## [45] cachem_1.0.8 grr_0.9.5
## [47] abind_1.4-5 nlme_3.1-163
## [49] tidyselect_1.2.0 aplot_0.2.2
## [51] digest_0.6.33 stringi_1.8.3
## [53] dplyr_1.1.4 purrr_1.0.2
## [55] bookdown_0.37 fastmap_1.1.1
## [57] grid_4.3.2 colorspace_2.1-0
## [59] cli_3.6.2 magrittr_2.0.3
## [61] patchwork_1.1.3 utf8_1.2.4
## [63] broom_1.0.5 ape_5.7-1
## [65] withr_2.5.2 scales_1.3.0
## [67] backports_1.4.1 httr_1.4.7
## [69] rmarkdown_2.25 ggsignif_0.6.4
## [71] ragg_1.2.6 memoise_2.0.1
## [73] evaluate_0.23 knitr_1.45
## [75] viridisLite_0.4.2 gridGraphics_0.5-1
## [77] rlang_1.1.2 Rcpp_1.0.11
## [79] glue_1.6.2 tidytree_0.4.6
## [81] BiocManager_1.30.22 jsonlite_1.8.8
## [83] R6_2.5.1 systemfonts_1.0.5
## [85] fs_1.6.3