Introduction

scNLP is an R package for applying Natural Language Processing (NLP) techniques to single-cell omics data. The primary use case is harmonizing non-standardized cell-type labels across datasets.

When integrating single-cell datasets from multiple sources, cell-type annotations are often inconsistent. Different labs may use different naming conventions for the same cell types (e.g., “Purkinje neurons” vs “Purkinje cells” vs “PCs”). scNLP provides tools to identify semantically similar labels and find enriched terms that characterize each cluster.

Key Features

  • TF-IDF Analysis: Identify enriched terms in cell-type labels using Term Frequency-Inverse Document Frequency
  • Visualization: Generate scatter plots and word clouds showing enriched terms per cluster
  • Neighbor Search: Find cells with similar labels using graph-based methods
  • Seurat Integration: Seamless integration with the Seurat ecosystem

Installation

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("scNLP")

Quick Start

Example Data

scNLP includes a pseudo-bulk Seurat object containing mean expression per cell-type from 11 different datasets.

data("pseudo_seurat")
pseudo_seurat
## An object of class Seurat 
## 2000 features across 801 samples within 1 assay 
## Active assay: RNA (2000 features, 0 variable features)
##  2 layers present: counts, data
##  2 dimensional reductions calculated: mnn, umap

The metadata contains cell-type labels, batch information, and cluster assignments:

head(pseudo_seurat[[]][, c("celltype", "batch", "cluster")])
##                         celltype       batch cluster
## human.DRONC_human.ASC1      ASC1 DRONC_human       5
## human.DRONC_human.ASC2      ASC2 DRONC_human       5
## human.DRONC_human.END        END DRONC_human       9
## human.DRONC_human.exCA1    exCA1 DRONC_human       0
## human.DRONC_human.exCA3    exCA3 DRONC_human       0
## human.DRONC_human.exDG      exDG DRONC_human       0

Core Workflow: TF-IDF Analysis

The core functionality of scNLP centers on TF-IDF (Term Frequency-Inverse Document Frequency) analysis. This NLP technique identifies words or phrases enriched in one “document” (cluster) relative to others.

Understanding TF-IDF

In the context of single-cell data:

  • Terms: Words within cell-type labels (e.g., “neuron”, “astrocyte”, “inhibitory”)
  • Documents: Clusters of cells
  • Goal: Find terms that best characterize each cluster

This is analogous to finding marker genes, but for text labels instead of gene expression.

Running TF-IDF

The run_tfidf() function computes TF-IDF scores and adds results to the Seurat object metadata:

result <- run_tfidf(
    obj = pseudo_seurat,
    reduction = "umap",
    cluster_var = "cluster",
    label_var = "celltype"
)
## Extracting obsm from Seurat: umap
## + Dropping 2 conflicting obs variables: UMAP.1, UMAP.2
## Loading required namespace: tidytext
## Setting cell metadata (obs) in obj.
# View enriched words per cluster
head(result[[]][, c("cluster", "celltype", "enriched_words", "tf_idf")])
##                         cluster celltype             enriched_words
## human.DRONC_human.ASC1        5     ASC1      glia; schwann; radial
## human.DRONC_human.ASC2        5     ASC2      glia; schwann; radial
## human.DRONC_human.END         9      END vascular; peric; pericytes
## human.DRONC_human.exCA1       0    exCA1          lpn; adpn; neuron
## human.DRONC_human.exCA3       0    exCA3          lpn; adpn; neuron
## human.DRONC_human.exDG        0     exDG          lpn; adpn; neuron
##                                                                             tf_idf
## human.DRONC_human.ASC1     0.198360552120631; 0.181900967132288; 0.111766521696813
## human.DRONC_human.ASC2     0.198360552120631; 0.181900967132288; 0.111766521696813
## human.DRONC_human.END                         0.528096815017439; 0.042313284392222
## human.DRONC_human.exCA1 0.0527542246967963; 0.0523351433907082; 0.0428030761818744
## human.DRONC_human.exCA3 0.0527542246967963; 0.0523351433907082; 0.0428030761818744
## human.DRONC_human.exDG  0.0527542246967963; 0.0523351433907082; 0.0428030761818744

Visualizing Results

Scatter Plot

plot_tfidf() creates a dimensional reduction plot with enriched terms labeled at cluster centers:

res <- plot_tfidf(
    obj = pseudo_seurat,
    label_var = "celltype",
    cluster_var = "cluster",
    show_plot = TRUE
)
## Extracting obsm from Seurat: umap
## + Dropping 2 conflicting obs variables: UMAP.1, UMAP.2
## Setting cell metadata (obs) in obj.
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
##  Please use tidy evaluation idioms with `aes()`.
##  See also `vignette("ggplot2-in-packages")` for more information.
##  The deprecated feature was likely used in the scNLP package.
##   Please report the issue at <https://github.com/neurogenomics/scNLP/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning in ggplot2::geom_point(ggplot2::aes_string(color = color_var, size =
## size_var, : Ignoring unknown aesthetics: label
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
##  Please use `linewidth` instead.
##  The deprecated feature was likely used in the scNLP package.
##   Please report the issue at <https://github.com/neurogenomics/scNLP/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The returned object contains:

  • data: Processed data used for plotting
  • tfidf_df: Full per-cluster TF-IDF results
  • plot: The ggplot object

Word Cloud

For an alternative visualization, wordcloud_tfidf() displays enriched terms as a word cloud:

wc_res <- wordcloud_tfidf(
    obj = pseudo_seurat,
    label_var = "celltype",
    cluster_var = "cluster",
    terms_per_cluster = 5
)
## Loading required namespace: ggwordcloud
## Extracting obsm from Seurat: umap
## + Dropping 2 conflicting obs variables: UMAP.1, UMAP.2
## Setting cell metadata (obs) in obj.
## Warning in ggplot2::geom_point(ggplot2::aes_string(color = color_var, size =
## size_var, : Ignoring unknown aesthetics: label

Additional Functions

Preprocessing Pipeline

If starting from raw counts, seurat_pipeline() provides a standard preprocessing workflow:

# Create Seurat object from counts
counts <- Seurat::GetAssayData(pseudo_seurat, layer = "counts")
obj <- SeuratObject::CreateSeuratObject(counts = counts)

# Run standard pipeline
processed <- seurat_pipeline(obj)

search_neighbors() finds cells with similar labels using the nearest-neighbor graph:

neighbors <- search_neighbors(
    seurat = pseudo_seurat,
    var1_search = "purkinje",
    max_neighbors = 5
)
## No variable features detected. Computing
## No PCA detected. Computing
## Centering and scaling data matrix
## PC_ 1 
## Positive:  FTMT, ACTG1, ALAS2, HSPA1L, NME1-NME2, POTEI, OTOP2, RAPSN, BEST2, DPEP2 
##     HMGB2, NAPRT1, KRT8, PPIB, DSCC1, POU4F3, CCDC102A, GAPDHS, CHST6, AGXT2 
##     LYZL2, MTMR8, ACTG2, ACPL2, BANF1, PPAPDC2, HTR1A, IFI30, CYBRD1, LHX8 
## Negative:  FAIM2, CAMK2A, SCN1A, CAMK2B, FRRS1L, UNC80, PHYHIP, RASGRF2, CCK, GRIA2 
##     STXBP5L, ARPP21, SLC12A5, DIRAS2, RYR2, SLC4A10, KCNT1, GRM5, CAMKV, KIAA1211L 
##     GABRA4, GABRA1, SV2B, CX3CL1, AK5, PNMA2, JPH4, DGKG, GPR158, KCNC2 
## PC_ 2 
## Positive:  CAMKK1, DGKQ, NT5DC3, CA7, ABCG4, HTR1A, C5orf28, OTOP2, HYKK, DPEP2 
##     CHST6, POTEI, SLC8A3, SLC38A11, ADRA2A, MPPED1, MTMR8, HTR7, CACNA1B, PPAPDC2 
##     C2orf69, GRIK1, IFI30, STK32B, RASL10B, SLC24A4, FAXDC2, ADCY3, ACSS2, ANKRD29 
## Negative:  RAN, HSP90AA1, H2AFZ, HNRNPAB, CCT5, NPM1, GNG5, DBI, HMGB2, ITM2B 
##     ATP6V1G1, SERPINH1, CIRBP, CD63, NDUFA6, MDK, JUN, MYL12B, SPARC, NPC2 
##     GLUL, ID3, EEF1A1, VIM, CLIC1, COX6B1, LDHA, DDAH2, ENO1, CNN3 
## PC_ 3 
## Positive:  ADGRL2, AC011288.2, RP11-420N3.3, RP11-191L9.4, NRXN3, PLPPR1, RP11-123O10.4, ZNF385D, AC114765.1, NWD2 
##     RBFOX3, MIR137HG, MIR325HG, SGOL1-AS1, POU6F2, ANKRD18A, LY86-AS1, LINC01197, DGCR5, DPY19L1P1 
##     MIR4300HG, AQP4-AS1, HPSE2, LINC00632, NLGN4X, AC067956.1, PWRN1, LINC00599, CABP1, LINC01158 
## Negative:  KRTCAP2, APOE, C20orf24, PDIA6, PGLS, GNG11, S100A13, HIST1H2BI, ISCA2, GSTM5 
##     LAPTM4A, CST3, TMEM176B, KLF4, PDLIM2, CAP1, S100A16, APRT, CYR61, FAIM 
##     IFITM3, CDKN1A, KLF2, CLIC1, ARPC1B, IER2, S100A1, CMTM5, FXYD1, TCN2 
## PC_ 4 
## Positive:  RESP18, CTXN2, ATP6V1G2, GNG13, DISP2, C15orf59, CCDC85A, GNG3, SYNGR3, RGS8 
##     VWA5B2, C1QL3, HPCA, TUBB3, CALB1, SNCB, HTR3A, ARHGDIG, L1CAM, NAP1L5 
##     PCDH20, HMP19, DBNDD2, NPAS4, FABP3, CALY, FAM43B, CKMT1B, LOC728392, LTK 
## Negative:  PTPN18, SLCO1A2, LINC00639, INPP5D, IFI44, LYN, DISC1, NEAT1, NRGN, CMYA5 
##     IFI44L, GALNT15, PARP14, AC012593.1, AQP4-AS1, MSR1, MT2A, ISG15, SHROOM4, CABP1 
##     UACA, KCNQ1OT1, PART1, CNDP1, FAM153B, DGCR5, SOX2-OT, LINC00844, ADGRG1, LINC00599 
## PC_ 5 
## Positive:  MEST, IGFBP2, CNN3, FBXL7, NNAT, TUBB2B, GPC3, VIM, NKAIN4, ID1 
##     BMP7, CSRP2, NDN, DDAH2, GPX8, IGFBPL1, MARCKSL1, GSTM3, FBLN1, PARD3 
##     MFAP4, PTN, FABP7, COPS6, CTNNA2, ZBTB20, BEX1, CD81, ENO1, NPAS3 
## Negative:  C1QB, FCGR2A, MS4A6A, TYROBP, C1QC, AIF1, C1QA, CSF1R, CD86, MRC1 
##     MS4A7, CTSS, CCL24, FCER1G, CD53, CD14, FCGR1A, PLEK, C3AR1, LYZ 
##     FCGR2B, CX3CR1, CCL3L3, CCL2, CCR1, CD68, C5AR1, PF4, HPGDS, LY86
## No graphs detected. Computing.
## Computing nearest neighbor graph
## Computing SNN
## Using graph: RNA_snn
## + Filtering results by `var1_search`: purkinje
## + 3 entries matching `var1_search` identified.
## + Adding original names to results
## + Returning 19 pair-wise similarities.
head(neighbors)
##                                            Var1
##                                          <char>
## 1: human.descartes_SampledData.Purkinje_neurons
## 2: human.descartes_SampledData.Purkinje_neurons
## 3: human.descartes_SampledData.Purkinje_neurons
## 4: human.descartes_SampledData.Purkinje_neurons
## 5:           zebrafish.Raj2020.purkinje_neurons
## 6:           zebrafish.Raj2020.purkinje_neurons
##                                                      Var2 similarity
##                                                    <char>      <num>
## 1: human.descartes_SampledData.SATB2_LRRC7_positive_cells  0.9047619
## 2:           human.descartes_SampledData.Oligodendrocytes  0.8181818
## 3:            human.descartes_SampledData.Granule_neurons  0.8181818
## 4:                human.descartes_SampledData.ENS_neurons  0.8181818
## 5:    zebrafish.Raj2020.neurons._gabaergic._glutamatergic  0.8181818
## 6:      zebrafish.Raj2020.purkinje_neurons_.gabaergic_...  0.8181818
##                                         Var1_id
##                                          <char>
## 1: human.descartes_SampledData.Purkinje_neurons
## 2: human.descartes_SampledData.Purkinje_neurons
## 3: human.descartes_SampledData.Purkinje_neurons
## 4: human.descartes_SampledData.Purkinje_neurons
## 5:           zebrafish.Raj2020.purkinje_neurons
## 6:           zebrafish.Raj2020.purkinje_neurons
##                                                   Var2_id
##                                                    <char>
## 1: human.descartes_SampledData.SATB2_LRRC7_positive_cells
## 2:           human.descartes_SampledData.Oligodendrocytes
## 3:            human.descartes_SampledData.Granule_neurons
## 4:                human.descartes_SampledData.ENS_neurons
## 5:    zebrafish.Raj2020.neurons._gabaergic._glutamatergic
## 6:      zebrafish.Raj2020.purkinje_neurons_.gabaergic_...

Variable Feature Selection

FindVariableFeatures_split() identifies variable features within data subsets, useful for batch-aware feature selection:

var_features <- FindVariableFeatures_split(
    seurat = pseudo_seurat,
    split.by = "batch",
    nfeatures = 100,
    nfeatures_max = 50
)

Working with Different Input Types

scNLP functions accept multiple input types: ## Seurat Objects

res <- plot_tfidf(
    obj = pseudo_seurat,
    label_var = "celltype",
    cluster_var = "cluster"
)

SingleCellExperiment Objects

data("pseudo_sce")
res <- plot_tfidf(
    obj = pseudo_sce,
    label_var = "celltype",
    cluster_var = "cluster"
)

Named Lists

For custom data structures, provide a list with metadata and embeddings:

data_list <- list(
    metadata = pseudo_seurat[[]],
    embeddings = Seurat::Embeddings(pseudo_seurat, "umap")
)
res <- plot_tfidf(
    obj = data_list,
    label_var = "celltype",
    cluster_var = "cluster"
)

Further Reading

For more detailed examples, see the additional vignettes:

Session Info

utils::sessionInfo()
## R Under development (unstable) (2026-01-22 r89323)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] future_1.69.0    scNLP_0.99.0     BiocStyle_2.39.0
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3     jsonlite_2.0.0         magrittr_2.0.4        
##   [4] spatstat.utils_3.2-1   farver_2.1.2           rmarkdown_2.30        
##   [7] fs_1.6.6               ragg_1.5.0             vctrs_0.7.1           
##  [10] ROCR_1.0-12            spatstat.explore_3.7-0 htmltools_0.5.9       
##  [13] janeaustenr_1.0.0      sass_0.4.10            sctransform_0.4.3     
##  [16] parallelly_1.46.1      KernSmooth_2.23-26     bslib_0.9.0           
##  [19] htmlwidgets_1.6.4      tokenizers_0.3.0       desc_1.4.3            
##  [22] ica_1.0-3              plyr_1.8.9             plotly_4.12.0         
##  [25] zoo_1.8-15             cachem_1.1.0           commonmark_2.0.0      
##  [28] igraph_2.2.1           mime_0.13              lifecycle_1.0.5       
##  [31] pkgconfig_2.0.3        Matrix_1.7-4           R6_2.6.1              
##  [34] fastmap_1.2.0          fitdistrplus_1.2-6     shiny_1.12.1          
##  [37] digest_0.6.39          tidytext_0.4.3         colorspace_2.1-2      
##  [40] patchwork_1.3.2        Seurat_5.4.0           tensor_1.5.1          
##  [43] RSpectra_0.16-2        irlba_2.3.5.1          SnowballC_0.7.1       
##  [46] textshaping_1.0.4      labeling_0.4.3         progressr_0.18.0      
##  [49] spatstat.sparse_3.1-0  httr_1.4.7             polyclip_1.10-7       
##  [52] abind_1.4-8            compiler_4.6.0         withr_3.0.2           
##  [55] S7_0.2.1               fastDummies_1.7.5      maps_3.4.3            
##  [58] MASS_7.3-65            tools_4.6.0            lmtest_0.9-40         
##  [61] otel_0.2.0             httpuv_1.6.16          future.apply_1.20.1   
##  [64] goftest_1.2-3          glue_1.8.0             nlme_3.1-168          
##  [67] promises_1.5.0         gridtext_0.1.5         grid_4.6.0            
##  [70] Rtsne_0.17             cluster_2.1.8.1        reshape2_1.4.5        
##  [73] generics_0.1.4         isoband_0.3.0          gtable_0.3.6          
##  [76] spatstat.data_3.1-9    tidyr_1.3.2            data.table_1.18.0     
##  [79] xml2_1.5.2             sp_2.2-0               spatstat.geom_3.7-0   
##  [82] RcppAnnoy_0.0.23       markdown_2.0           ggrepel_0.9.6         
##  [85] RANN_2.6.2             pillar_1.11.1          stringr_1.6.0         
##  [88] pals_1.10              spam_2.11-3            RcppHNSW_0.6.0        
##  [91] later_1.4.5            splines_4.6.0          dplyr_1.1.4           
##  [94] lattice_0.22-7         survival_3.8-6         deldir_2.0-4          
##  [97] tidyselect_1.2.1       miniUI_0.1.2           pbapply_1.7-4         
## [100] knitr_1.51             gridExtra_2.3          litedown_0.9          
## [103] bookdown_0.46          scattermore_1.2        xfun_0.56             
## [106] matrixStats_1.5.0      stringi_1.8.7          lazyeval_0.2.2        
## [109] yaml_2.3.12            evaluate_1.0.5         codetools_0.2-20      
## [112] ggwordcloud_0.6.2      tibble_3.3.1           BiocManager_1.30.27   
## [115] cli_3.6.5              uwot_0.2.4             xtable_1.8-4          
## [118] reticulate_1.44.1      systemfonts_1.3.1      jquerylib_0.1.4       
## [121] dichromat_2.0-0.1      Rcpp_1.1.1             globals_0.18.0        
## [124] spatstat.random_3.4-4  mapproj_1.2.12         png_0.1-8             
## [127] spatstat.univar_3.1-6  parallel_4.6.0         pkgdown_2.2.0         
## [130] ggplot2_4.0.1          dotCall64_1.2          listenv_0.10.0        
## [133] viridisLite_0.4.2      scales_1.4.0           ggridges_0.5.7        
## [136] SeuratObject_5.3.0     purrr_1.2.1            rlang_1.1.7           
## [139] cowplot_1.2.0