Authors: Hiranyamaya (Hiru) Dash, Thomas Roberts, Nathan Skene
Updated: Nov-04-2024
Introduction
MotifPeeker
is used to compare and analyse datasets from epigenomic profiling methods with motif enrichment as the key benchmark. The package outputs an HTML report consisting of three sections:
General Metrics: Provides an overview of metrics related to dataset peaks, including FRiP scores, peak widths, and motif-to-summit distances.
Known Motif Enrichment Analysis: Presents statistics on the frequency of enriched user-supplied motifs in the datasets and compares them between the common and unique peaks from comparison and reference datasets.
De-Novo Motif Enrichment Analysis: Details the statistics of de-novo discovered motifs in common and unique peaks from comparison and reference datasets. Examines motif similarities and identifies the closest known motifs in the JASPAR or the provided database.
Installation
MotifPeeker
uses memes
which relies on a local install of the MEME suite, which can be installed as follows:
MEME_VERSION=5.5.5 # or the latest version
wget https://meme-suite.org/meme/meme-software/$MEME_VERSION/meme-$MEME_VERSION.tar.gz
tar zxf meme-$MEME_VERSION.tar.gz
cd meme-$MEME_VERSION
./configure --prefix=$HOME/meme --with-url=http://meme-suite.org/ \
--enable-build-libxml2 --enable-build-libxslt
make
make install
# Add to PATH
echo 'export PATH=$HOME/meme/bin:$HOME/meme/libexec/meme-$MEME_VERSION:$PATH' >> ~/.bashrc
echo 'export MEME_BIN=$HOME/meme/bin' >> ~/.bashrc
source ~/.bashrc
NOTE: It is important that Perl dependencies associated with MEME suite are also installed, particularly XML::Parser
, which can be installed using the following command in the terminal:
For more information, refer to the Perl dependency section of the MEME suite.
Once the MEME suite and its associated Perl dependencies are installed, the development version of MotifPeeker
can be installed using the following code:
if(!require("remotes")) install.packages("remotes")
remotes::install_github("neurogenomics/MotifPeeker")
library(MotifPeeker)
Alternatively, you can use the Docker/Singularity container to run the package out-of-the-box.
Usage
Load the package and example datasets.
library(MotifPeeker)
data("CTCF_ChIP_peaks", package = "MotifPeeker")
data("CTCF_TIP_peaks", package = "MotifPeeker")
data("motif_MA1102.2", package = "MotifPeeker")
data("motif_MA1930.2", package = "MotifPeeker")
Prepare input files.
peak_files <- list(CTCF_ChIP_peaks, CTCF_TIP_peaks)
alignment_files <- list(
system.file("extdata", "CTCF_ChIP_alignment.bam", package = "MotifPeeker"),
system.file("extdata", "CTCF_TIP_alignment.bam", package = "MotifPeeker")
)
motif_files <- list(motif_MA1102.2, motif_MA1930.2)
Run MotifPeeker()
:
MotifPeeker(
peak_files = peak_files,
reference_index = 2, # Set TIP-seq experiment as reference
alignment_files = alignment_files,
exp_labels = c("ChIP", "TIP"),
exp_type = c("chipseq", "tipseq"),
genome_build = "hg38",
motif_files = motif_files,
cell_counts = NULL, # No cell-count information
denovo_motif_discovery = TRUE,
denovo_motifs = 3,
motif_db = NULL,
download_buttons = TRUE,
out_dir = tempdir(),
workers = 2,
debug = FALSE,
quiet = FALSE,
verbose = TRUE
)
Required Inputs
These input parameters must be provided:
Details
-
peak_files
: A list of path to peak files orGRanges
objects with the peaks to analyse. Currently, only peak files fromMACS2/3
(.narrowPeak
) andSEACR
(.bed
) are supported. ENCODE file IDs can also be provided to automatically fetch peak file(s) from the ENCODE database.
-
reference_index
: An integer specifying the index of the reference dataset in thepeak_files
list to use as reference for various comparisons. (default = 1)
-
genome_build
: A character string or aBSgenome
object specifying the genome build of the datasets. At the moment, only hg38 and hg19 are supported as abbreviated input.
-
out_dir
: A character string specifying the output directory to save the HTML report and other files.
Optional Inputs
These input parameters optional, but recommended to add more analyses, or enhance them:
Details
-
alignment_files
: A list of path to alignment files orRsamtools::BamFile
objects with the alignment sequences to analyse. Alignment files are used to calculate read-related metrics like FRiP score. ENCODE file IDs can also be provided to automatically fetch alignment file(s) from the ENCODE database.
-
exp_labels
: A character vector of labels for each peak file. If not provided, capital letters will be used as labels in the report. -
exp_type
: A character vector of experimental types for each peak file.
Useful for comparison of different methods. If not provided, all datasets will be classified as “unknown” experiment types in the report.exp_type
is used only for labelling. It does not affect the analyses. You can also input custom strings. Datasets will be grouped as long as they match their respectiveexp_type
. Supported experimental types are:-
chipseq
: ChIP-seq data
-
tipseq
: TIP-seq data
-
cuttag
: CUT&Tag data
-
cutrun
: CUT&Run data
-
-
motif_files
: A character vector of path to motif files, or a vector ofuniversalmotif-class
objects. Required to run Known Motif Enrichment Analysis. JASPAR matrix IDs can also be provided to automatically fetch motifs from the JASPAR.
-
motif_labels
: A character vector of labels for each motif file. Only used if path to file names are passed in motif_files. If not provided, the motif file names will be used as labels.
-
cell_counts
: An integer vector of experiment cell counts for each peak file (if available). Creates additional comparisons based on cell counts.
-
motif_db
: Path to.meme
format file to use as reference database, or a list ofuniversalmotif-class
objects. Results from de-novo motif discovery are searched against this database to find similar motifs. If not provided, JASPAR CORE database will be used, making this parameter truly optional. NOTE: p-value estimates are inaccurate when the database has fewer than 50 entries.
Other Options
For more information on additional parameters, please refer to the documentation for MotifPeeker()
.
Runtime Guidance
For 4 datasets, the runtime is approximately 3 minutes with denovo_motif_discovery disabled. However, de-novo motif discovery can take hours to complete.
To make computation faster, we highly recommend tuning the following arguments:
Details
-
workers
: Running motif discovery in parallel can significantly reduce runtime, but it is very memory-intensive, consuming upwards of 10GB of RAM per thread. Memory starvation can greatly slow the process, so setworkers
with caution.
-
denovo_motifs
: The number of motifs to discover per sequence group exponentially increases runtime. We recommend no more than 5 motifs to make a meaningful inference.
-
trim_seq_width
: Trimming sequences before running de-novo motif discovery can significantly reduce the search space. Sequence length can exponentially increase runtime. We recommend running the script withdenovo_motif_discovery = FALSE
and studying the motif-summit distance distribution under general metrics to find the sequence length that captures most motifs. A good starting point is 150 but it can be reduced further if appropriate.
Outputs
MotifPeeker
generates its output in a new folder within he out_dir
directory. The folder is named MotifPeeker_YYYYMMDD_HHMMSS
and contains the following files:
-
MotifPeeker.html
: The main HTML report, including all analyses and plots.
- Output from various MEME suite tools in their respecive sub-directories, if
save_runfiles
is set toTRUE
.
Datasets
MotifPeeker
comes with several datasets bundled:
Details
-
CTCF_TIP_peaks
: Human CTCF peak file generated with TIP-seq using HCT116 cell-line. No control files were used to generate the peak file. The peaks were called usingMACS3
withCTCF_TIP_alignment.bam
as input.
-
CTCF_ChIP_peaks
: Human CTCF peak file generated with ChIP-seq using HCT116 cell-line. No control files were used to generate the peak file. The peaks were called usingMACS3
withCTCF_ChIP_alignment.bam
as input.
-
motif_MA1102.3
: The JASPAR motif for CTCFL (MA1102.3) for Homo Sapiens. Sourced from JASPAR -
motif_MA1930.2
: The JASPAR motif for CTCFL (MA1930.2) for Homo Sapiens. Sourced from JASPAR -
CTCF_TIP_alignment.bam
: Human CTCF alignment file generated with TIP-seq using HCT116 cell-line. The alignment file was generated using thenf-core/cutandrun
pipeline. Raw read files were sourced from NIH Sequence Read Archives ID: SRR16963166. Only available as extdata.
-
CTCF_ChIP_alignment.bam
: Human CTCF alignment file generated with ChIP-seq using HCT116 cell-line. Sourced from ENCODE (Accession: ENCFF091ODJ). Only available as extdata.
Please note that the peaks and alignments included are a very small subset (chr10:65,654,529-74,841,155) of the actual data. It only serves as an example to demonstrate the package and run tests to maintain the integrity of the package.
Licensing Restrictions
MotifPeeker incorporates the MEME Suite, which is available free of charge for educational, research, and non-profit purposes. Users intending to use MotifPeeker for commercial purposes are required to purchase a license for the MEME Suite.
For more details, please refer to the MEME Suite Copyright Page.
Contact
Neurogenomics Lab
UK Dementia Research Institute
Department of Brain Sciences
Faculty of Medicine
Imperial College London
GitHub
Session Info
utils::sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rmarkdown_2.28
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 jsonlite_1.8.9 renv_1.0.11
## [4] dplyr_1.1.4 compiler_4.4.1 BiocManager_1.30.25
## [7] tidyselect_1.2.1 rvcheck_0.2.1 scales_1.3.0
## [10] yaml_2.3.10 fastmap_1.2.0 here_1.0.1
## [13] ggplot2_3.5.1 R6_2.5.1 generics_0.1.3
## [16] knitr_1.48 yulab.utils_0.1.7 tibble_3.2.1
## [19] desc_1.4.3 dlstats_0.1.7 rprojroot_2.0.4
## [22] munsell_0.5.1 pillar_1.9.0 RColorBrewer_1.1-3
## [25] rlang_1.1.4 utf8_1.2.4 badger_0.2.4
## [28] xfun_0.49 fs_1.6.5 cli_3.6.3
## [31] magrittr_2.0.3 rworkflows_1.0.2 digest_0.6.37
## [34] grid_4.4.1 lifecycle_1.0.4 vctrs_0.6.5
## [37] evaluate_1.0.1 glue_1.8.0 data.table_1.16.2
## [40] fansi_1.0.6 colorspace_2.1-1 tools_4.4.1
## [43] pkgconfig_2.0.3 htmltools_0.5.8.1