Run Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP).
run_umap(
mat,
transpose = TRUE,
pca = ncol(mat),
add_names = TRUE,
n_components = 2,
n_neighbors = 15,
min_dist = 0.01,
metric = "euclidean",
init = "spectral",
seed = 2020,
verbose = TRUE,
...
)
Arguments
mat |
Matrix to run UMAP on. |
transpose |
Whether to transpose the matrix first. |
pca |
If set to a positive integer value, reduce data to this number of
columns using PCA. Doesn't applied if the distance metric is
"hamming" , or the dimensions of the data is larger than the
number specified (i.e. number of rows and columns must be larger than the
value of this parameter). If you have > 100 columns in a data frame or
matrix, reducing the number of columns in this way may substantially
increase the performance of the nearest neighbor search at the cost of a
potential decrease in accuracy. In many t-SNE applications, a value of 50
is recommended, although there's no guarantee that this is appropriate for
all settings. |
add_names |
Add colnames and rownames to embeddings and loadings. |
n_components |
The dimension of the space to embed into. This defaults
to 2 to provide easy visualization, but can reasonably be set to any
integer value in the range 2 to 100 . |
n_neighbors |
The size of local neighborhood (in terms of number of
neighboring sample points) used for manifold approximation. Larger values
result in more global views of the manifold, while smaller values result in
more local data being preserved. In general values should be in the range
2 to 100 . |
min_dist |
The effective minimum distance between embedded points.
Smaller values will result in a more clustered/clumped embedding where
nearby points on the manifold are drawn closer together, while larger
values will result on a more even dispersal of points. The value should be
set relative to the spread value, which determines the scale at
which embedded points will be spread out. |
metric |
Type of distance metric to use to find nearest neighbors. One
of:
"euclidean" (the default)
"cosine"
"manhattan"
"hamming"
"correlation" (a distance based on the Pearson correlation)
"categorical" (see below)
Only applies if nn_method = "annoy" (for nn_method = "fnn" , the
distance metric is always "euclidean").
If X is a data frame or matrix, then multiple metrics can be
specified, by passing a list to this argument, where the name of each item in
the list is one of the metric names above. The value of each list item should
be a vector giving the names or integer ids of the columns to be included in
a calculation, e.g. metric = list(euclidean = 1:4, manhattan = 5:10) .
Each metric calculation results in a separate fuzzy simplicial set, which are
intersected together to produce the final set. Metric names can be repeated.
Because non-numeric columns are removed from the data frame, it is safer to
use column names than integer ids.
Factor columns can also be used by specifying the metric name
"categorical" . Factor columns are treated different from numeric
columns and although multiple factor columns can be specified in a vector,
each factor column specified is processed individually. If you specify
a non-factor column, it will be coerced to a factor.
For a given data block, you may override the pca and pca_center
arguments for that block, by providing a list with one unnamed item
containing the column names or ids, and then any of the pca or
pca_center overrides as named items, e.g. metric =
list(euclidean = 1:4, manhattan = list(5:10, pca_center = FALSE)) . This
exists to allow mixed binary and real-valued data to be included and to have
PCA applied to both, but with centering applied only to the real-valued data
(it is typical not to apply centering to binary data before PCA is applied). |
init |
Type of initialization for the coordinates. Options are:
"spectral" Spectral embedding using the normalized Laplacian
of the fuzzy 1-skeleton, with Gaussian noise added.
"normlaplacian" . Spectral embedding using the normalized
Laplacian of the fuzzy 1-skeleton, without noise.
"random" . Coordinates assigned using a uniform random
distribution between -10 and 10.
"lvrandom" . Coordinates assigned using a Gaussian
distribution with standard deviation 1e-4, as used in LargeVis
(Tang et al., 2016) and t-SNE.
"laplacian" . Spectral embedding using the Laplacian Eigenmap
(Belkin and Niyogi, 2002).
"pca" . The first two principal components from PCA of
X if X is a data frame, and from a 2-dimensional classical
MDS if X is of class "dist" .
"spca" . Like "pca" , but each dimension is then scaled
so the standard deviation is 1e-4, to give a distribution similar to that
used in t-SNE. This is an alias for init = "pca", init_sdev =
1e-4 .
"agspectral" An "approximate global" modification of
"spectral" which all edges in the graph to a value of 1, and then
sets a random number of edges (negative_sample_rate edges per
vertex) to 0.1, to approximate the effect of non-local affinities.
A matrix of initial coordinates.
For spectral initializations, ("spectral" , "normlaplacian" ,
"laplacian" ), if more than one connected component is identified,
each connected component is initialized separately and the results are
merged. If verbose = TRUE the number of connected components are
logged to the console. The existence of multiple connected components
implies that a global view of the data cannot be attained with this
initialization. Either a PCA-based initialization or increasing the value of
n_neighbors may be more appropriate. |
seed |
Seed passed to \[base]set.seed for reproducibility between runs. |
verbose |
If TRUE , log details to the console. |
... |
Additional parameters passed to umap. |
Source
UMAP documentation
Details
Uses umap, but runs and returns PCA by default.