Benchmark epigenomic profiling methods using motif enrichment
Source:R/MotifPeeker.R
MotifPeeker.Rd
This function compares different epigenomic datasets using motif enrichment as the key metric. The output is an easy-to-interpret HTML document with the results. The report contains three main sections: (1) General Metrics on peak and alignment files (if provided), (2) Known Motif Enrichment Analysis and (3) De-novo Motif Enrichment Analysis.
Usage
MotifPeeker(
peak_files,
reference_index = 1,
alignment_files = NULL,
exp_labels = NULL,
exp_type = NULL,
genome_build,
motif_files = NULL,
motif_labels = NULL,
cell_counts = NULL,
denovo_motif_discovery = TRUE,
denovo_motifs = 3,
filter_n = 6,
trim_seq_width = NULL,
motif_db = NULL,
download_buttons = TRUE,
meme_path = NULL,
out_dir = tempdir(),
save_runfiles = FALSE,
display = if (interactive()) "browser",
workers = 2,
quiet = TRUE,
debug = FALSE,
verbose = FALSE
)
Arguments
- peak_files
A character vector of path to peak files, or a vector of GRanges objects generated using
read_peak_file
. Currently, peak files from the following peak-calling tools are supported:MACS2:
.narrowPeak
filesSEACR:
.bed
files
ENCODE file IDs can also be provided to automatically fetch peak file(s) from the ENCODE database.
- reference_index
An integer specifying the index of the peak file to use as the reference dataset for comparison. Indexing starts from 1. (default = 1)
- alignment_files
A character vector of path to alignment files, or a vector of
BamFile
objects. (optional) Alignment files are used to calculate read-related metrics like FRiP score. ENCODE file IDs can also be provided to automatically fetch alignment file(s) from the ENCODE database.- exp_labels
A character vector of labels for each peak file. (optional) If not provided, capital letters will be used as labels in the report.
- exp_type
A character vector of experimental types for each peak file. (optional) Useful for comparison of different methods. If not provided, all datasets will be classified as "unknown" experiment types in the report. Supported experimental types are:
chipseq
: ChIP-seq datatipseq
: TIP-seq datacuttag
: CUT&Tag datacutrun
: CUT&Run data
exp_type
is used only for labelling. It does not affect the analysis. You can also input custom strings. Datasets will be grouped as long as they match their respectiveexp_type
.- genome_build
A character string with the abbreviated genome build name, or a BSGenome object. At the moment, only
hg38
andhg19
are supported as abbreviated input.- motif_files
A character vector of path to motif files, or a vector of
universalmotif-class
objects. (optional) Required to run Known Motif Enrichment Analysis. JASPAR matrix IDs can also be provided to automatically fetch motifs from the JASPAR.- motif_labels
A character vector of labels for each motif file. (optional) Only used if path to file names are passed in
motif_files
. If not provided, the motif file names will be used as labels.- cell_counts
An integer vector of experiment cell counts for each peak file. (optional) Creates additional comparisons based on cell counts.
- denovo_motif_discovery
A logical indicating whether to perform de-novo motif discovery for the third section of the report. (default = TRUE)
- denovo_motifs
An integer specifying the number of de-novo motifs to discover. (default = 3) Note that higher values take longer to compute.
- filter_n
An integer specifying the number of consecutive nucleotide repeats a de-novo discovered motif must contain to be filtered out. (default = 6)
- trim_seq_width
An integer specifying the width of the sequence to extract around the summit (default = NULL). This sequence is used to search for de novo motifs. If not provided, the entire peak region will be used. This parameter is intended to reduce the search space and speed up motif discovery; therefore, a value less than the average peak width is recommended. Peaks are trimmed symmetrically around the summit while respecting the peak bounds.
- motif_db
Path to
.meme
format file to use as reference database, or a list ofuniversalmotif-class
objects. (optional) Results from de-novo motif discovery are searched against this database to find similar motifs. If not provided, JASPAR CORE database will be used. NOTE: p-value estimates are inaccurate when the database has fewer than 50 entries.A logical indicating whether to include download buttons for various files within the HTML report. (default = TRUE)
- meme_path
path to
meme/bin/
(optional). Defaut:NULL
, searches "MEME_PATH" environment variable or "meme_path" option for path to "meme/bin/".- out_dir
A character string specifying the directory to save the output files. (default =
tempdir()
) A sub-directory with the output files will be created in this directory.- save_runfiles
A logical indicating whether to save intermediate files generated during the run, such as those from FIMO and AME. (default = FALSE)
- display
A character vector specifying the display mode for the HTML report once it is generated. (default = NULL) Options are:
"browser"
: Open the report in the default web browser."rstudio"
: Open the report in the RStudio Viewer.NULL
: Do not open the report.
- workers
An integer specifying the number of threads to use for parallel processing. (default = 1)
IMPORTANT: For each worker, please ensure a minimum of 6GB of memory (RAM) is available asdenovo_motif_discovery
is memory-intensive.- quiet
A logical indicating whether to print markdown knit messages. (default = FALSE)
- debug
A logical indicating whether to print debug/error messages in the HTML report. (default = FALSE)
- verbose
A logical indicating whether to print verbose messages while running the function. (default = FALSE)
Details
Runtime guidance: For 4 datasets, the runtime is approximately 3 minutes with denovo_motif_discovery disabled. However, de-novo motif discovery can take hours to complete. To make computation faster, we highly recommend tuning the following arguments:
workers
Running motif discovery in parallel can significantly reduce runtime, but it is very memory-intensive, consuming upwards of 10GB of RAM per thread. Memory starvation can greatly slow the process, so set
workers
with caution.denovo_motifs
The number of motifs to discover per sequence group exponentially increases runtime. We recommend no more than 5 motifs to make a meaningful inference.
trim_seq_width
Trimming sequences before running de-novo motif discovery can significantly reduce the search space. Sequence length can exponentially increase runtime. We recommend running the script with
denovo_motif_discovery = FALSE
and studying the motif-summit distance distribution under general metrics to find the sequence length that captures most motifs. A good starting point is 150 but it can be reduced further if appropriate.
Note
Running de-novo motif discovery is computationally expensive and can
require from minutes to hours. denovo_motifs
can widely affect the
runtime (higher values take longer). Setting trim_seq_width
to a lower
value can also reduce the runtime significantly.
Examples
peaks <- list(
system.file("extdata", "CTCF_ChIP_peaks.narrowPeak",
package = "MotifPeeker"),
system.file("extdata", "CTCF_TIP_peaks.narrowPeak",
package = "MotifPeeker")
)
alignments <- list(
system.file("extdata", "CTCF_ChIP_alignment.bam",
package = "MotifPeeker"),
system.file("extdata", "CTCF_TIP_alignment.bam",
package = "MotifPeeker")
)
motifs <- list(
system.file("extdata", "motif_MA1930.2.jaspar",
package = "MotifPeeker"),
system.file("extdata", "motif_MA1102.3.jaspar",
package = "MotifPeeker")
)
# \donttest{
if (memes::meme_is_installed()) {
# MotifPeeker takes time to run
MotifPeeker(
peak_files = peaks,
reference_index = 1,
alignment_files = alignments,
exp_labels = c("ChIP", "TIP"),
exp_type = c("chipseq", "tipseq"),
genome_build = "hg38",
motif_files = motifs,
motif_labels = NULL,
cell_counts = NULL,
denovo_motif_discovery = TRUE,
denovo_motifs = 1,
motif_db = NULL,
download_buttons = TRUE,
out_dir = tempdir(),
workers = 1,
debug = FALSE,
quiet = TRUE,
verbose = FALSE
)
}
#> [1] "/tmp/Rtmp7e65j3/MotifPeeker_20241104_165523"
# }