vignettes/peakyfinders.Rmd
peakyfinders.Rmd
Some peaks files are already made available on GEO.
import_peaks
will import these as a named list
GenomicRanges
, where the names are the GEO sample
ids
.
grl <- PeakyFinders::import_peaks(
ids = "GSM945244",
searches = construct_searches(keys = c("narrowpeak",
"broadpeak")))
## Processing id(s).
## 1 unique GEO id(s) identified.
## Querying 1 id(s) from: GEO
## Processing id: >>> GSM945244 <<<
## Determining available file types.
## Found file link(s) for 2 categories.
## narrowpeak :
## >>> https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM945nnn/GSM945244/suppl//GSM945244_hg19_wgEncodeUwHistoneA549H3k04me3StdPkRep1.narrowPeak.gz
## >>> https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM945nnn/GSM945244/suppl//GSM945244_hg19_wgEncodeUwHistoneA549H3k04me3StdPkRep2.narrowPeak.gz
## broadpeak :
## >>> https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM945nnn/GSM945244/suppl//GSM945244_hg19_wgEncodeUwHistoneA549H3k04me3StdHotspotsRep1.broadPeak.gz
## >>> https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM945nnn/GSM945244/suppl//GSM945244_hg19_wgEncodeUwHistoneA549H3k04me3StdHotspotsRep2.broadPeak.gz
##
## Importing pre-computed narrowPeak files.
## Importing pre-computed broadPeak files.
##
## Saving results ==> /tmp/RtmpiiRuDp/file18cf2ccce9e4_PeakyFinders_grl.rds
By default, the genome-wide files are imported. However, you can also
query specific subset of these peak files defined by
query_granges
.
Here, we’ll just construct query for all of chromosome 22 using the
helper function get_genome
query_granges <- PeakyFinders::get_genome(genome = "hg19",
keep.chr = 22)
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
grl2 <- PeakyFinders::import_peaks(
ids = "GSM945244",
searches = construct_searches(keys = c("narrowpeak",
"broadpeak")),
builds = "hg19",
query_granges = query_granges,
query_granges_build = "hg19")
## Processing id(s).
## 1 unique GEO id(s) identified.
## dat is already a GRanges object.
## Querying 1 id(s) from: GEO
## Processing id: >>> GSM945244 <<<
## Determining available file types.
## Found file link(s) for 2 categories.
## narrowpeak :
## >>> https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM945nnn/GSM945244/suppl//GSM945244_hg19_wgEncodeUwHistoneA549H3k04me3StdPkRep1.narrowPeak.gz
## >>> https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM945nnn/GSM945244/suppl//GSM945244_hg19_wgEncodeUwHistoneA549H3k04me3StdPkRep2.narrowPeak.gz
## broadpeak :
## >>> https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM945nnn/GSM945244/suppl//GSM945244_hg19_wgEncodeUwHistoneA549H3k04me3StdHotspotsRep1.broadPeak.gz
## >>> https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM945nnn/GSM945244/suppl//GSM945244_hg19_wgEncodeUwHistoneA549H3k04me3StdHotspotsRep2.broadPeak.gz
## grlist is already in the output_build format. Skipping liftover.
##
## Importing pre-computed narrowPeak files.
## Importing pre-computed broadPeak files.
## Subsetting peaks.
##
## Saving results ==> /tmp/RtmpiiRuDp/file18cfd40e56_PeakyFinders_grl.rds
knitr::kable(
head(data.frame(grl2$GSM945244))
)
|| || || ||
When no pre-computed peaks are available, you can instead use
import_peaks
to compute peaks from bigWig or
bedGraph files. The default peak calling method is currently MACS3
via
the MACSr
package.
Note:
By default, peaks are called using whole-chromosomes, from which a subset specified by
query_granges
is returned. This ensures that an appropriate background is used when computing peaks. You can also call peaks using the whole genome at once, but this is usally less computationally efficient than splitting the task by chromosome (i.e. settingsplit_chromosomes=TRUE
whenquery_granges
is not provided).
By default,
import_peaks
(and the subfunctioncall_peaks
) automatically infer a reasonable cutoff threshold based on an iterative testing procedure. See?PeakyFinders::call_peaks
for more details.
Warning:
Peak calling with MACS3
/MACSr
is not
currently compatible with Windows OS.
### rtracklayer::import.bw() is currently broken on Windows.
if(.Platform$OS.type!="windows"){
grl3 <- PeakyFinders::import_peaks(
ids = "GSM945244",
searches = construct_searches(keys = "bigwig"),
builds = "hg19",
query_granges = query_granges,
query_granges_build = "hg19"
)
knitr::kable(
head(data.frame(grl3$GSM945244))
)
}
## Processing id(s).
## 1 unique GEO id(s) identified.
## dat is already a GRanges object.
## Querying 1 id(s) from: GEO
## Processing id: >>> GSM945244 <<<
## Determining available file types.
## Found file link(s) for 1 category.
## bigwig :
## >>> https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM945nnn/GSM945244/suppl//GSM945244_hg19_wgEncodeUwHistoneA549H3k04me3StdRawRep1.bigWig
## >>> https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM945nnn/GSM945244/suppl//GSM945244_hg19_wgEncodeUwHistoneA549H3k04me3StdRawRep2.bigWig
## grlist is already in the output_build format. Skipping liftover.
##
## Computing peaks from bigWig file.
## Importing chromosome(s): chr22
## + Importing as: BigWig
## Returning BigWig of length 242,161
## Fixing GRanges seqinfo.
## Loading required namespace: TxDb.Hsapiens.UCSC.hg19.knownGene
## Writing bigWig subset as bedGraph.
## File already in bedGraph format.
## Analyzing cutoff thresholds.
## INFO:root:Read and build bedGraph...
## INFO:root:Analyze cutoff vs number of peaks/total length of peaks/average length of peak
## INFO:root:Write report...
## INFO:root:Done
## Calling peaks.
## INFO:root:Read and build bedGraph...
## INFO:root:Call peaks from bedGraph...
## INFO:root:Write peaks...
## INFO:root:Done
## 280 peaks called.
## Importing chromosome(s): chr22
## + Importing as: BigWig
## Returning BigWig of length 268,442
## Fixing GRanges seqinfo.
## Writing bigWig subset as bedGraph.
## File already in bedGraph format.
## Analyzing cutoff thresholds.
## INFO:root:Read and build bedGraph...
## INFO:root:Analyze cutoff vs number of peaks/total length of peaks/average length of peak
## INFO:root:Write report...
## INFO:root:Done
## Calling peaks.
## INFO:root:Read and build bedGraph...
## INFO:root:Call peaks from bedGraph...
## INFO:root:Write peaks...
## INFO:root:Done
## 259 peaks called.
## Subsetting peaks.
##
## Saving results ==> /tmp/RtmpiiRuDp/file18cf61a73d7b_PeakyFinders_grl.rds
|| || || ||
To speed up the process of importing and/or calling peaks, you can
take advantage of the parallelisation features in
import_peaks
.
For example, if you want to call peaks across the entire genome from
a given bigWig file, you could speed up this process by
dividing the task by chromosome (split_chromosomes=TRUE
)
and distributing it across multiple cores or compute nodes
(nThread=2
or greater).
Internally, PeakyFinders
takes a number of steps to
optimise these parallel processes, so you don’t have to worry about it
as much. This helps to avoid common issues stemming from excessive
numbers of queries, both to the servers where the peak data is hosted,
and to the C libraries that R relies on to makes these queries.
Note:
It’s usually a good idea to leave several cores free when running parallel processes, so that you can still have enough resources to use your computer while the function is running. In the example below, I’m using 10/12 cores on my local laptop.
Warning:
Parallelisation is not currently compatible with Windows OS. If
Windows OS is detected, import_peaks will automatically switch to serial
processing with nThread=1
.
grl4 <- PeakyFinders::import_peaks(
ids = "GSM945244",
searches = construct_searches(keys = "bigwig"),
split_chromosomes = TRUE,
nThread = 10
)
utils::sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics utils methods base
##
## other attached packages:
## [1] BSgenome.Hsapiens.UCSC.hg19.masked_1.3.993
## [2] BSgenome.Hsapiens.UCSC.hg19_1.4.3
## [3] BSgenome_1.65.1
## [4] rtracklayer_1.57.0
## [5] Biostrings_2.65.0
## [6] XVector_0.37.0
## [7] GenomicRanges_1.49.0
## [8] GenomeInfoDb_1.33.3
## [9] IRanges_2.31.0
## [10] S4Vectors_0.35.0
## [11] BiocGenerics_0.43.0
## [12] PeakyFinders_0.99.0
## [13] BiocStyle_2.25.0
##
## loaded via a namespace (and not attached):
## [1] rjson_0.2.21
## [2] ellipsis_0.3.2
## [3] rprojroot_2.0.3
## [4] fs_1.5.2
## [5] bit64_4.0.5
## [6] interactiveDisplayBase_1.35.0
## [7] AnnotationDbi_1.59.1
## [8] fansi_1.0.3
## [9] xml2_1.3.3
## [10] R.methodsS3_1.8.1
## [11] cachem_1.0.6
## [12] knitr_1.39
## [13] jsonlite_1.8.0
## [14] Rsamtools_2.13.3
## [15] dbplyr_2.1.1
## [16] png_0.1-7
## [17] R.oo_1.24.0
## [18] shiny_1.7.1
## [19] BiocManager_1.30.18
## [20] readr_2.1.2
## [21] compiler_4.2.0
## [22] httr_1.4.3
## [23] basilisk_1.9.0
## [24] assertthat_0.2.1
## [25] Matrix_1.4-1
## [26] fastmap_1.1.0
## [27] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [28] limma_3.53.2
## [29] cli_3.3.0
## [30] later_1.3.0
## [31] prettyunits_1.1.1
## [32] htmltools_0.5.2
## [33] tools_4.2.0
## [34] glue_1.6.2
## [35] GenomeInfoDbData_1.2.8
## [36] dplyr_1.0.9
## [37] grDevices_4.2.0
## [38] rappdirs_0.3.3
## [39] Rcpp_1.0.8.3
## [40] Biobase_2.57.1
## [41] jquerylib_0.1.4
## [42] pkgdown_2.0.3.9000
## [43] vctrs_0.4.1
## [44] ExperimentHub_2.5.0
## [45] xfun_0.31
## [46] stringr_1.4.0
## [47] mime_0.12
## [48] lifecycle_1.0.1
## [49] restfulr_0.0.13
## [50] XML_3.99-0.9
## [51] AnnotationHub_3.5.0
## [52] zlibbioc_1.43.0
## [53] basilisk.utils_1.9.0
## [54] MatrixGenerics_1.9.0
## [55] ragg_1.2.2
## [56] hms_1.1.1
## [57] promises_1.2.0.1
## [58] SummarizedExperiment_1.27.1
## [59] parallel_4.2.0
## [60] GEOquery_2.65.2
## [61] yaml_2.3.5
## [62] curl_4.3.2
## [63] memoise_2.0.1
## [64] reticulate_1.25
## [65] sass_0.4.1
## [66] datasets_4.2.0
## [67] biomaRt_2.53.2
## [68] stringi_1.7.6
## [69] RSQLite_2.2.14
## [70] BiocVersion_3.16.0
## [71] BiocIO_1.7.1
## [72] desc_1.4.1
## [73] GenomicFeatures_1.49.4
## [74] filelock_1.0.2
## [75] BiocParallel_1.31.6
## [76] rlang_1.0.2
## [77] pkgconfig_2.0.3
## [78] systemfonts_1.0.4
## [79] bitops_1.0-7
## [80] matrixStats_0.62.0
## [81] evaluate_0.15
## [82] lattice_0.20-45
## [83] purrr_0.3.4
## [84] GenomicAlignments_1.33.0
## [85] bit_4.0.4
## [86] tidyselect_1.1.2
## [87] here_1.0.1
## [88] magrittr_2.0.3
## [89] bookdown_0.26
## [90] R6_2.5.1
## [91] generics_0.1.2
## [92] DelayedArray_0.23.0
## [93] DBI_1.1.2
## [94] MACSr_1.5.0
## [95] pillar_1.7.0
## [96] KEGGREST_1.37.0
## [97] RCurl_1.98-1.6
## [98] tibble_3.1.7
## [99] dir.expiry_1.5.0
## [100] crayon_1.5.1
## [101] utf8_1.2.2
## [102] BiocFileCache_2.5.0
## [103] tzdb_0.3.0
## [104] rmarkdown_2.14
## [105] progress_1.2.2
## [106] grid_4.2.0
## [107] data.table_1.14.2
## [108] blob_1.2.3
## [109] digest_0.6.29
## [110] xtable_1.8-4
## [111] tidyr_1.2.0
## [112] httpuv_1.6.5
## [113] regioneR_1.29.0
## [114] R.utils_2.11.0
## [115] textshaping_0.3.6
## [116] bslib_0.3.1