vignettes/tf-idf.Rmd
tf-idf.RmdTerm frequency-inverse document frequency (tf-idf) is an NLP technique to identify words or phrases that are enriched in one document relative to some other larger set of documents.
In our case, our words are within the non-standardized cell labels and our “documents” are the clusters. The goals is to find words that are enriched in each cluster relative to all the other clusters. This can be thought of as an NLP equivalent of finding gene markers for each cluster.
If you don’t already have a Seurat object with reduced
dimensions and cluster assignments, you can generate a new one with the
following support function.
## Create a fresh Seurat object from raw data
counts <- Seurat::GetAssayData(pseudo_seurat, layer = "counts")
meta.data <- pseudo_seurat[[]]
## Create Seurat object with metadata, then run pipeline
obj <- SeuratObject::CreateSeuratObject(
counts = counts,
meta.data = meta.data
)
processed_seurat <- seurat_pipeline(obj = obj)## Running NormalizeData...
## Normalizing layer: counts
## Running FindVariableFeatures...
## Finding variable features for layer counts
## Running ScaleData...
## Centering and scaling data matrix
## Running PCA...
## PC_ 1
## Positive: FTMT, ACTG1, ALAS2, HSPA1L, NME1-NME2, POTEI, OTOP2, RAPSN, BEST2, DPEP2
## HMGB2, NAPRT1, KRT8, PPIB, DSCC1, POU4F3, CCDC102A, GAPDHS, CHST6, AGXT2
## LYZL2, MTMR8, ACTG2, ACPL2, BANF1, PPAPDC2, HTR1A, IFI30, CYBRD1, LHX8
## Negative: FAIM2, CAMK2A, SCN1A, CAMK2B, FRRS1L, UNC80, PHYHIP, RASGRF2, CCK, GRIA2
## STXBP5L, ARPP21, SLC12A5, DIRAS2, RYR2, SLC4A10, KCNT1, GRM5, CAMKV, KIAA1211L
## GABRA4, GABRA1, SV2B, CX3CL1, AK5, PNMA2, JPH4, DGKG, GPR158, KCNC2
## PC_ 2
## Positive: CAMKK1, DGKQ, NT5DC3, CA7, ABCG4, HTR1A, C5orf28, OTOP2, HYKK, DPEP2
## CHST6, POTEI, SLC8A3, SLC38A11, ADRA2A, MPPED1, MTMR8, HTR7, CACNA1B, PPAPDC2
## C2orf69, GRIK1, IFI30, STK32B, RASL10B, SLC24A4, FAXDC2, ADCY3, ACSS2, ANKRD29
## Negative: RAN, HSP90AA1, H2AFZ, HNRNPAB, CCT5, NPM1, GNG5, DBI, HMGB2, ITM2B
## ATP6V1G1, SERPINH1, CIRBP, CD63, NDUFA6, MDK, JUN, MYL12B, SPARC, NPC2
## GLUL, ID3, EEF1A1, VIM, CLIC1, COX6B1, LDHA, DDAH2, ENO1, CNN3
## PC_ 3
## Positive: ADGRL2, AC011288.2, RP11-420N3.3, RP11-191L9.4, NRXN3, PLPPR1, RP11-123O10.4, ZNF385D, AC114765.1, NWD2
## RBFOX3, MIR137HG, MIR325HG, SGOL1-AS1, POU6F2, ANKRD18A, LY86-AS1, LINC01197, DGCR5, DPY19L1P1
## MIR4300HG, AQP4-AS1, HPSE2, LINC00632, NLGN4X, AC067956.1, PWRN1, LINC00599, CABP1, LINC01158
## Negative: KRTCAP2, APOE, C20orf24, PDIA6, PGLS, GNG11, S100A13, HIST1H2BI, ISCA2, GSTM5
## LAPTM4A, CST3, TMEM176B, KLF4, PDLIM2, CAP1, S100A16, APRT, CYR61, FAIM
## IFITM3, CDKN1A, KLF2, CLIC1, ARPC1B, IER2, S100A1, CMTM5, FXYD1, TCN2
## PC_ 4
## Positive: RESP18, CTXN2, ATP6V1G2, GNG13, DISP2, C15orf59, CCDC85A, GNG3, SYNGR3, RGS8
## VWA5B2, C1QL3, HPCA, TUBB3, CALB1, SNCB, HTR3A, ARHGDIG, L1CAM, NAP1L5
## PCDH20, HMP19, DBNDD2, NPAS4, FABP3, CALY, FAM43B, CKMT1B, LOC728392, LTK
## Negative: PTPN18, SLCO1A2, LINC00639, INPP5D, IFI44, LYN, DISC1, NEAT1, NRGN, CMYA5
## IFI44L, GALNT15, PARP14, AC012593.1, AQP4-AS1, MSR1, MT2A, ISG15, SHROOM4, CABP1
## UACA, KCNQ1OT1, PART1, CNDP1, FAM153B, DGCR5, SOX2-OT, LINC00844, ADGRG1, LINC00599
## PC_ 5
## Positive: MEST, IGFBP2, CNN3, FBXL7, NNAT, TUBB2B, GPC3, VIM, NKAIN4, ID1
## BMP7, CSRP2, NDN, DDAH2, GPX8, IGFBPL1, MARCKSL1, GSTM3, FBLN1, PARD3
## MFAP4, PTN, FABP7, COPS6, CTNNA2, ZBTB20, BEX1, CD81, ENO1, NPAS3
## Negative: C1QB, FCGR2A, MS4A6A, TYROBP, C1QC, AIF1, C1QA, CSF1R, CD86, MRC1
## MS4A7, CTSS, CCL24, FCER1G, CD53, CD14, FCGR1A, PLEK, C3AR1, LYZ
## FCGR2B, CX3CR1, CCL3L3, CCL2, CCR1, CD68, C5AR1, PF4, HPGDS, LY86
## Running UMAP...
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 01:25:20 UMAP embedding parameters a = 0.9922 b = 1.112
## 01:25:20 Read 801 rows and found 30 numeric columns
## 01:25:20 Using Annoy for neighbor search, n_neighbors = 30
## 01:25:21 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 01:25:21 Writing NN index file to temp file /tmp/RtmpHRUwVw/file152270050fb8
## 01:25:21 Searching Annoy index using 1 thread, search_k = 3000
## 01:25:21 Annoy recall = 100%
## 01:25:21 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 01:25:22 Found 2 connected components, falling back to 'spca' initialization with init_sdev = 1
## 01:25:22 Using 'irlba' for PCA
## 01:25:22 PCA: 2 components explained 52.16% variance
## 01:25:22 Scaling init to sdev = 1
## 01:25:22 Commencing optimization for 500 epochs, with 27816 positive edges
## 01:25:22 Using rng type: pcg
## 01:25:22 Optimization finished
## Running FindNeighbors...
## Computing nearest neighbor graph
## Computing SNN
## Running FindClusters...
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 801
## Number of edges: 19587
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8727
## Number of communities: 12
## Elapsed time: 0 seconds
seurat_tfidf will run tf-idf on each
cluster and put the results in the enriched_words and
tf_idf cols of the meta.data.
pseudo_seurat_tfidf <- run_tfidf(
obj = pseudo_seurat,
reduction = "UMAP",
cluster_var = "cluster",
label_var = "celltype"
)## Extracting obsm from Seurat: umap
## + Dropping 2 conflicting obs variables: UMAP.1, UMAP.2
## Loading required namespace: tidytext
## Setting cell metadata (obs) in obj.
head(pseudo_seurat_tfidf[[]])## cluster batch species dataset celltype label
## human.DRONC_human.ASC1 5 DRONC_human human DRONC_human ASC1 ASC1
## human.DRONC_human.ASC2 5 DRONC_human human DRONC_human ASC2 ASC2
## human.DRONC_human.END 9 DRONC_mouse mouse DRONC_mouse END END
## human.DRONC_human.exCA1 0 DRONC_human human DRONC_human exCA1 exCA1
## human.DRONC_human.exCA3 0 DRONC_human human DRONC_human exCA3 exCA3
## human.DRONC_human.exDG 0 DRONC_human human DRONC_human exDG exDG
## nCount_RNA nFeature_RNA RNA_snn_res.0.8 seurat_clusters
## human.DRONC_human.ASC1 756.6266 1693 5 5
## human.DRONC_human.ASC2 766.3392 1603 5 5
## human.DRONC_human.END 885.2824 1645 9 9
## human.DRONC_human.exCA1 714.6469 1677 0 0
## human.DRONC_human.exCA3 634.1760 1657 0 0
## human.DRONC_human.exDG 659.2845 1700 0 0
## UMAP_1 UMAP_2 enriched_words
## human.DRONC_human.ASC1 -0.4796632 0.17629431 glia; schwann; radial
## human.DRONC_human.ASC2 -0.6386602 -0.05231967 glia; schwann; radial
## human.DRONC_human.END -7.7066403 -1.84134831 vascular; peric; pericytes
## human.DRONC_human.exCA1 6.2326443 1.51104526 lpn; adpn; neuron
## human.DRONC_human.exCA3 6.0303471 1.47096417 lpn; adpn; neuron
## human.DRONC_human.exDG 5.9316036 1.49563257 lpn; adpn; neuron
## tf_idf
## human.DRONC_human.ASC1 0.198360552120631; 0.181900967132288; 0.111766521696813
## human.DRONC_human.ASC2 0.198360552120631; 0.181900967132288; 0.111766521696813
## human.DRONC_human.END 0.528096815017439; 0.042313284392222
## human.DRONC_human.exCA1 0.0527542246967963; 0.0523351433907082; 0.0428030761818744
## human.DRONC_human.exCA3 0.0527542246967963; 0.0523351433907082; 0.0428030761818744
## human.DRONC_human.exDG 0.0527542246967963; 0.0523351433907082; 0.0428030761818744
You can also plot the results in reduced dimensional space
(e.g. UMAP). plot_tfidf() will produce a list with three
items. - data: The processed data used to create the plot.
- tfidf_df: The full per-cluster TF-IDF enrichment results.
- plot: The ggplot.
Seurat input
res <- plot_tfidf(
obj = pseudo_seurat,
label_var = "celltype",
cluster_var = "cluster",
show_plot = TRUE
)## Extracting obsm from Seurat: umap
## + Dropping 2 conflicting obs variables: UMAP.1, UMAP.2
## Setting cell metadata (obs) in obj.
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the scNLP package.
## Please report the issue at <https://github.com/neurogenomics/scNLP/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning in ggplot2::geom_point(ggplot2::aes_string(color = color_var, size =
## size_var, : Ignoring unknown aesthetics: label
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the scNLP package.
## Please report the issue at <https://github.com/neurogenomics/scNLP/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

You can color the point by other metadata attributes instead.
res <- plot_tfidf(
obj = pseudo_seurat,
label_var = "celltype",
cluster_var = "cluster",
color_var = "batch",
show_plot = TRUE
)## Extracting obsm from Seurat: umap
## + Dropping 2 conflicting obs variables: UMAP.1, UMAP.2
## Setting cell metadata (obs) in obj.
## Warning in ggplot2::geom_point(ggplot2::aes_string(color = color_var, size =
## size_var, : Ignoring unknown aesthetics: label

SingleCellExperiment input
plot_tfidf() can also take in an object of class
SingleCellExperiment.
data("pseudo_sce")
res <- plot_tfidf(
obj = pseudo_sce,
label_var = "celltype",
cluster_var = "cluster",
show_plot = TRUE
)list input
Lastly, if your data doesn’t fit the above example data types, you
can simply supply a named list with
metadata and embeddings.
sce_coldata <- SingleCellExperiment::colData(pseudo_sce)
data_list <- list(
metadata = sce_coldata,
embeddings = sce_coldata[, c("UMAP.1", "UMAP.2")]
)
res <- plot_tfidf(
obj = data_list,
label_var = "celltype",
cluster_var = "cluster",
show_plot = TRUE
)You can also create an interactive version of this plot.
res <- plot_tfidf(
obj = pseudo_seurat_tfidf,
label_var = "celltype",
cluster_var = "cluster",
interact = TRUE,
show_plot = TRUE,
species = "species",
dataset = "dataset",
enriched_words = "enriched_words",
tf_idf = "tf_idf"
)You can also show the per-cluster tf-idf results as a wordcloud.
wordcloud_res <- wordcloud_tfidf(
obj = pseudo_seurat,
label_var = "celltype",
cluster_var = "cluster",
terms_per_cluster = 10
)## Loading required namespace: ggwordcloud
## Extracting obsm from Seurat: umap
## + Dropping 2 conflicting obs variables: UMAP.1, UMAP.2
## Setting cell metadata (obs) in obj.
## Warning in ggplot2::geom_point(ggplot2::aes_string(color = color_var, size =
## size_var, : Ignoring unknown aesthetics: label

print(wordcloud_res$tfidf_df)## # A tibble: 147 × 8
## # Groups: cluster [15]
## cluster word n total samples tf idf tf_idf
## <fct> <chr> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 0 lpn 3 154 129 0.0195 2.71 0.0528
## 2 0 adpn 4 154 129 0.0260 2.01 0.0523
## 3 0 neuron 6 154 129 0.0390 1.10 0.0428
## 4 0 neurons 5 154 129 0.0325 1.10 0.0357
## 5 0 ex6a 2 154 129 0.0130 2.71 0.0352
## 6 0 pm1 2 154 129 0.0130 2.71 0.0352
## 7 0 proc 2 154 129 0.0130 2.71 0.0352
## 8 0 c2 1 154 129 0.00649 2.71 0.0176
## 9 0 c3 1 154 129 0.00649 2.71 0.0176
## 10 0 ca1pyr1 1 154 129 0.00649 2.71 0.0176
## # ℹ 137 more rows
utils::sessionInfo()## R Under development (unstable) (2026-01-22 r89323)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] future_1.69.0 scNLP_0.99.0 BiocStyle_2.39.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 jsonlite_2.0.0 magrittr_2.0.4
## [4] spatstat.utils_3.2-1 farver_2.1.2 rmarkdown_2.30
## [7] fs_1.6.6 ragg_1.5.0 vctrs_0.7.1
## [10] ROCR_1.0-12 spatstat.explore_3.7-0 htmltools_0.5.9
## [13] janeaustenr_1.0.0 sass_0.4.10 sctransform_0.4.3
## [16] parallelly_1.46.1 KernSmooth_2.23-26 bslib_0.9.0
## [19] htmlwidgets_1.6.4 tokenizers_0.3.0 desc_1.4.3
## [22] ica_1.0-3 plyr_1.8.9 plotly_4.12.0
## [25] zoo_1.8-15 cachem_1.1.0 commonmark_2.0.0
## [28] igraph_2.2.1 mime_0.13 lifecycle_1.0.5
## [31] pkgconfig_2.0.3 Matrix_1.7-4 R6_2.6.1
## [34] fastmap_1.2.0 fitdistrplus_1.2-6 shiny_1.12.1
## [37] digest_0.6.39 tidytext_0.4.3 colorspace_2.1-2
## [40] patchwork_1.3.2 Seurat_5.4.0 tensor_1.5.1
## [43] RSpectra_0.16-2 irlba_2.3.5.1 SnowballC_0.7.1
## [46] textshaping_1.0.4 labeling_0.4.3 progressr_0.18.0
## [49] spatstat.sparse_3.1-0 httr_1.4.7 polyclip_1.10-7
## [52] abind_1.4-8 compiler_4.6.0 withr_3.0.2
## [55] S7_0.2.1 fastDummies_1.7.5 maps_3.4.3
## [58] MASS_7.3-65 tools_4.6.0 lmtest_0.9-40
## [61] otel_0.2.0 httpuv_1.6.16 future.apply_1.20.1
## [64] goftest_1.2-3 glue_1.8.0 nlme_3.1-168
## [67] promises_1.5.0 gridtext_0.1.5 grid_4.6.0
## [70] Rtsne_0.17 cluster_2.1.8.1 reshape2_1.4.5
## [73] generics_0.1.4 isoband_0.3.0 gtable_0.3.6
## [76] spatstat.data_3.1-9 tidyr_1.3.2 data.table_1.18.0
## [79] utf8_1.2.6 xml2_1.5.2 sp_2.2-0
## [82] spatstat.geom_3.7-0 RcppAnnoy_0.0.23 markdown_2.0
## [85] ggrepel_0.9.6 RANN_2.6.2 pillar_1.11.1
## [88] stringr_1.6.0 pals_1.10 spam_2.11-3
## [91] RcppHNSW_0.6.0 later_1.4.5 splines_4.6.0
## [94] dplyr_1.1.4 lattice_0.22-7 survival_3.8-6
## [97] deldir_2.0-4 tidyselect_1.2.1 miniUI_0.1.2
## [100] pbapply_1.7-4 knitr_1.51 gridExtra_2.3
## [103] litedown_0.9 bookdown_0.46 scattermore_1.2
## [106] xfun_0.56 matrixStats_1.5.0 stringi_1.8.7
## [109] lazyeval_0.2.2 yaml_2.3.12 evaluate_1.0.5
## [112] codetools_0.2-20 ggwordcloud_0.6.2 tibble_3.3.1
## [115] BiocManager_1.30.27 cli_3.6.5 uwot_0.2.4
## [118] xtable_1.8-4 reticulate_1.44.1 systemfonts_1.3.1
## [121] jquerylib_0.1.4 dichromat_2.0-0.1 Rcpp_1.1.1
## [124] globals_0.18.0 spatstat.random_3.4-4 mapproj_1.2.12
## [127] png_0.1-8 spatstat.univar_3.1-6 parallel_4.6.0
## [130] pkgdown_2.2.0 ggplot2_4.0.1 dotCall64_1.2
## [133] listenv_0.10.0 viridisLite_0.4.2 scales_1.4.0
## [136] ggridges_0.5.7 SeuratObject_5.3.0 purrr_1.2.1
## [139] rlang_1.1.7 cowplot_1.2.0