Currently supports ortholog mapping between any
pair of 700+ species.
Use map_species to
return a full list of available organisms.
convert_orthologs(
gene_df,
gene_input = "rownames",
gene_output = "rownames",
standardise_genes = FALSE,
input_species,
output_species = "human",
method = c("gprofiler", "homologene", "babelgene"),
drop_nonorths = TRUE,
non121_strategy = "drop_both_species",
agg_fun = NULL,
mthreshold = Inf,
as_sparse = FALSE,
as_DelayedArray = FALSE,
sort_rows = FALSE,
gene_map = NULL,
input_col = "input_gene",
output_col = "ortholog_gene",
verbose = TRUE,
...
)
Data object containing the genes
(see gene_input
for options on how
the genes can be stored within the object).
Can be one of the following formats:
matrix
:
A sparse or dense matrix.
data.frame
:
A data.frame
,
data.table
. or tibble
.
codelist :
A list
or character vector
.
Genes, transcripts, proteins, SNPs, or genomic ranges
can be provided in any format
(HGNC, Ensembl, RefSeq, UniProt, etc.) and will be
automatically converted to gene symbols unless
specified otherwise with the ...
arguments.
Note: If you set method="homologene"
, you
must either supply genes in gene symbol format (e.g. "Sox2")
OR set standardise_genes=TRUE
.
Which aspect of gene_df
to
get gene names from:
"rownames"
:
From row names of data.frame/matrix.
"colnames"
:
From column names of data.frame/matrix.
<column name>
:
From a column in gene_df
,
e.g. "gene_names"
.
How to return genes.
Options include:
"rownames"
:
As row names of gene_df
.
"colnames"
:
As column names of gene_df
.
"columns"
:
As new columns "input_gene", "ortholog_gene"
(and "input_gene_standard" if standardise_genes=TRUE
)
in gene_df
.
"dict"
:
As a dictionary (named list) where the names
are input_gene and the values are ortholog_gene.
"dict_rev"
:
As a reversed dictionary (named list)
where the names are ortholog_gene and the values are input_gene.
If TRUE
AND
gene_output="columns"
, a new column "input_gene_standard"
will be added to gene_df
containing standardised HGNC symbols
identified by gorth.
Name of the input species (e.g., "mouse","fly"). Use map_species to return a full list of available species.
Name of the output species (e.g. "human","chicken"). Use map_species to return a full list of available species.
R package to use for gene mapping:
"gprofiler"
: Slower but more species and genes.
"homologene"
: Faster but fewer species and genes.
"babelgene"
: Faster but fewer species and genes.
Also gives consensus scores for each gene mapping based on a
several different data sources.
Drop genes that don't have an ortholog
in the output_species
.
How to handle genes that don't have
1:1 mappings between input_species
:output_species
.
Options include:
"drop_both_species" or "dbs" or 1
:
Drop genes that have duplicate
mappings in either the input_species
or output_species
(DEFAULT).
"drop_input_species" or "dis" or 2
:
Only drop genes that have duplicate
mappings in the input_species
.
"drop_output_species" or "dos" or 3
:
Only drop genes that have duplicate
mappings in the output_species
.
"keep_both_species" or "kbs" or 4
:
Keep all genes regardless of whether
they have duplicate mappings in either species.
"keep_popular" or "kp" or 5
:
Return only the most "popular" interspecies ortholog mappings.
This procedure tends to yield a greater number of returned genes
but at the cost of many of them not being true biological 1:1 orthologs.
"sum","mean","median","min" or "max"
:
When gene_df
is a matrix and gene_output="rownames"
,
these options will aggregate many-to-one gene mappings
(input_species
-to-output_species
)
after dropping any duplicate genes in the output_species
.
Aggregation function passed to
aggregate_mapped_genes.
Set to NULL
to skip aggregation step (default).
Maximum number of ortholog names per gene to show.
Passed to gorth.
Only used when method="gprofiler"
(DEFAULT : Inf
).
Convert gene_df
to a sparse matrix.
Only works if gene_df
is one of the following classes:
matrix
Matrix
data.frame
data.table
tibble
If gene_df
is a sparse matrix to begin with,
it will be returned as a sparse matrix
(so long as gene_output=
"rownames"
or "colnames"
).
Convert aggregated matrix to DelayedArray.
Sort gene_df
rows alphanumerically.
A data.frame that maps the current gene names to new gene names. This function's behaviour will adapt to different situations as follows:
gene_map=<data.frame>
:
When a data.frame containing the
gene key:value columns
(specified by input_col
and output_col
, respectively)
is provided, this will be used to perform aggregation/expansion.
gene_map=NULL
and input_species!=output_species
:
A gene_map
is automatically generated by
map_orthologs to perform inter-species
gene aggregation/expansion.
gene_map=NULL
and input_species==output_species
:
A gene_map
is automatically generated by
map_genes to perform within-species
gene gene symbol standardization and aggregation/expansion.
Column name within gene_map
with gene names matching
the row names of X
.
Column name within gene_map
with gene names
that you wish you map the row names of X
onto.
Print messages.
Additional arguments to be passed to
gorth or homologene.
NOTE: To return only the most "popular"
interspecies ortholog mappings,
supply mthreshold=1
here AND set method="gprofiler"
above.
This procedure tends to yield a greater number of returned genes but at
the cost of many of them not being true biological 1:1 orthologs.
For more details, please see
here.
gene_df
with orthologs converted to the
output_species
.
Instead returned as a dictionary (named list) if
gene_output="dict"
or "dict_rev"
.
data("exp_mouse")
gene_df <- convert_orthologs(
gene_df = exp_mouse,
input_species = "mouse"
)
#> Preparing gene_df.
#> sparseMatrix format detected.
#> Extracting genes from rownames.
#> 15,259 genes extracted.
#> Converting mouse ==> human orthologs using: gprofiler
#> Retrieving all organisms available in gprofiler.
#> Using stored `gprofiler_orgs`.
#> Mapping species name: mouse
#> Common name mapping found for mouse
#> 1 organism identified from search: mmusculus
#> Retrieving all organisms available in gprofiler.
#> Using stored `gprofiler_orgs`.
#> Mapping species name: human
#> Common name mapping found for human
#> 1 organism identified from search: hsapiens
#> Checking for genes without orthologs in human.
#> Extracting genes from input_gene.
#> 15,690 genes extracted.
#> Extracting genes from ortholog_gene.
#> 15,690 genes extracted.
#> Dropping 2,512 NAs of all kinds from ortholog_gene.
#> Checking for genes without 1:1 orthologs.
#> Dropping 285 genes that have multiple input_gene per ortholog_gene (many:1).
#> Dropping 215 genes that have multiple ortholog_gene per input_gene (1:many).
#> Filtering gene_df with gene_map
#> Setting ortholog_gene to rownames.
#>
#> =========== REPORT SUMMARY ===========
#> Total genes dropped after convert_orthologs :
#> 2,834 / 15,259 (19%)
#> Total genes remaining after convert_orthologs :
#> 12,425 / 15,259 (81%)