EpiCompare is an R package for comparing multiple epigenomic datasets for quality control and benchmarking purposes. The function outputs a report in HTML format consisting of three sections:
Note: Peaks located in blacklisted regions and non-standard chromosomes are removed from the files prior to analysis.
Installing all Imports and Suggests will allow you to use the full functionality of
EpiCompare right away, without having to stop and install extra dependencies later on.
To install these packages as well, use:
Note that this will increase installation time, but it means that you won’t have to worry about installing any R packages when using functions with certain suggested dependencies
If you use
EpiCompare, please cite:
EpiCompare: R package for the comparison and quality control of epigenomic peak files (2022) Sera Choi, Brian M. Schilder, Leyla Abbasova, Alan E. Murphy, Nathan G. Skene, bioRxiv, 2022.07.22.501149; doi: https://doi.org/10.1101/2022.07.22.501149
The documentation in this README and the GitHub Pages website pertains to the development version of
EpiCompare. Older versions of
EpiCompare may have slightly different documentation (e.g. available functions, parameters). For documentation in older versions of
EpiCompare, please see the Documentation section of the relevant version on Bioconductor
Load package and example datasets.
Prepare input files:
EpiCompare::gather_files is helpful for identifying and importing peak or picard files.
# To import BED files as GRanges object peakfiles <- EpiCompare::gather_files(dir = "path/to/peaks/", type = "peaks.stringent") # EpiCompare alternatively accepts paths (to BED files) as input peakfiles <- list(sample1="/path/to/peaks/file1_peaks.stringent.bed", sample2="/path/to/peaks/file2_peaks.stringent.bed") # To import Picard summary output txt file as data frame picard_files <- EpiCompare::gather_files(dir = "path/to/peaks", type = "picard")
These input parameters must be provided:
peakfiles: Peakfiles you want to analyse. EpiCompare accepts peakfiles as GRanges object and/or as paths to BED files. Files must be listed and named using
genome_build: A named list indicating the human genome build used to generate each of the following inputs:
peakfiles: Genome build for the
peakfilesinput. Assumes genome build is the same for each element in the
reference: Genome build for the
blacklist: Genome build for the
genome_build = list(peakfiles="hg38", reference="hg19", blacklist="hg19")
genome_build_outputGenome build to standardise all inputs to. Liftovers will be performed automatically as needed. Default is “hg19”.
blacklist: Peakfile as GRanges object specifying genomic regions that have anomalous and/or unstructured signals independent of the cell-line or experiment. For human hg19 and hg38 genome, use built-in data
data(hg38_blacklist)respectively. For mouse mm10 genome, use built-in data
output_dir: Please specify the path to directory, where all
EpiCompareoutputs will be saved.
The following input files are optional:
picard_files: A list of summary metrics output from Picard. Picard MarkDuplicates can be used to identify the duplicate reads amongst the alignment. This tool generates a summary output, normally with the ending .markdup.MarkDuplicates.metrics.txt. If this input is provided, metrics on fragments (e.g. mapped fragments and duplication rate) will be included in the report. Files must be in data.frame format and listed using
list()and named using
names(). To import Picard duplication metrics (.txt file) into R as data frame, use
picard <- read.table("/path/to/picard/output", header = TRUE, fill = TRUE).
reference: Reference peak file(s) is used in
chromHMM_plot. File must be in
GRangesobject, listed and named using
list("reference_name" = GRanges_obect). If more than one reference is specified,
EpiCompareoutputs individual reports for each reference. However, please note that this can take awhile.
By default, these plots will not be included in the report unless set to
TRUE. To turn on all features at once, simply use the
upset_plot: Upset plot of overlapping peaks between samples.
stat_plot: included only if a
referencedataset is provided. The plot shows statistical significance (p/q-values) of sample peaks that are overlapping/non-overlapping with the
chromHMM_plot: ChromHMM annotation of peaks. If a
referencedataset is provided, ChromHMM annotation of overlapping and non-overlapping peaks with the
referenceis also included in the report.
chipseeker_plot: ChIPseeker annotation of peaks.
enrichment_plot: KEGG pathway and GO enrichment analysis of peaks.
tss_plot: Peak frequency around (+/- 3000bp) transcriptional start site. Note that it may take awhile to generate this plot for large sample sizes.
precision_recall_plot: Plot showing the precision-recall score across the peak calling stringency thresholds.
corr_plot: Plot showing the correlation between the quantiles when the genome is binned at a set size. These quantiles are based on the intensity of the peak, dependent on the peak caller used (q-value for MACS2).
chromHMM_annotation: Cell-line annotation for ChromHMM. Default is K562. Options are:
interact: By default, all heatmaps (percentage overlap and ChromHMM heatmaps) in the report will be interactive. If set FALSE, all heatmaps will be static. N.B. If
interact=TRUE, interactive heatmaps will be saved as html files, which may take time for larger sample sizes.
output_filename: By default, the report is named EpiCompare.html. You can specify the file name of the report here.
output_timestamp: By default FALSE. If TRUE, the filename of the report includes the date.
EpiCompare outputs the following:
save_output=TRUE, all plots generated by
EpiComparewill be saved in EpiCompare_file directory also in specified
An example report comparing ATAC-seq and DNase-seq can be found here
EpiCompare includes several built-in datasets:
encode_H3K27ac: Human H3K27ac peak file generated with ChIP-seq using K562 cell-line. Taken from ENCODE project. For more information, run
CnT_H3K27ac: Human H3K27ac peak file generated with CUT&Tag using K562 cell-line from Kaya-Okur et al., (2019). For more information, run
CnR_H3K27ac: Human H3K27ac peak file generated with CUT&Run using K562 cell-line from Meers et al., (2019). For more details, run
## R Under development (unstable) (2023-11-14 r85524) ## Platform: x86_64-pc-linux-gnu ## Running under: Ubuntu 22.04.3 LTS ## ## Matrix products: default ## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 ## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0 ## ## locale: ##  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ##  LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 ##  LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ##  LC_PAPER=en_US.UTF-8 LC_NAME=C ##  LC_ADDRESS=C LC_TELEPHONE=C ##  LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## time zone: UTC ## tzcode source: system (glibc) ## ## attached base packages: ##  stats graphics grDevices utils datasets methods base ## ## other attached packages: ##  rmarkdown_2.25 ## ## loaded via a namespace (and not attached): ##  gtable_0.3.4 jsonlite_1.8.7 renv_1.0.3 ##  dplyr_1.1.4 compiler_4.4.0 BiocManager_1.30.22 ##  tidyselect_1.2.0 rvcheck_0.2.1 scales_1.2.1 ##  yaml_2.3.7 fastmap_1.1.1 here_1.0.1 ##  ggplot2_3.4.4 R6_2.5.1 generics_0.1.3 ##  knitr_1.45 yulab.utils_0.1.0 tibble_3.2.1 ##  desc_1.4.2 dlstats_0.1.7 rprojroot_2.0.4 ##  munsell_0.5.0 pillar_1.9.0 RColorBrewer_1.1-3 ##  rlang_1.1.2 utf8_1.2.4 cachem_1.0.8 ##  badger_0.2.3 xfun_0.41 fs_1.6.3 ##  memoise_2.0.1 cli_3.6.1 magrittr_2.0.3 ##  rworkflows_1.0.0 digest_0.6.33 grid_4.4.0 ##  lifecycle_1.0.4 vctrs_0.6.4 data.table_1.14.8 ##  evaluate_0.23 glue_1.6.2 fansi_1.0.5 ##  colorspace_2.1-0 tools_4.4.0 pkgconfig_2.0.3 ##  htmltools_0.5.7