Import peaks

Pre-computed peaks

Some peaks files are already made available on GEO. import_peaks will import these as a named list GenomicRanges, where the names are the GEO sample ids.

grl <- PeakyFinders::import_peaks(
    ids = "GSM945244",
    searches = construct_searches(keys = c("narrowpeak",
                                           "broadpeak"))) 
## Processing id(s).
## 1 unique GEO id(s) identified.
## Querying 1 id(s) from: GEO
## Processing id: >>> GSM945244 <<<
## Determining available file types.
## Found file link(s) for 2 categories.
## narrowpeak : 
## >>> https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM945nnn/GSM945244/suppl//GSM945244_hg19_wgEncodeUwHistoneA549H3k04me3StdPkRep1.narrowPeak.gz
## >>> https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM945nnn/GSM945244/suppl//GSM945244_hg19_wgEncodeUwHistoneA549H3k04me3StdPkRep2.narrowPeak.gz
## broadpeak : 
## >>> https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM945nnn/GSM945244/suppl//GSM945244_hg19_wgEncodeUwHistoneA549H3k04me3StdHotspotsRep1.broadPeak.gz
## >>> https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM945nnn/GSM945244/suppl//GSM945244_hg19_wgEncodeUwHistoneA549H3k04me3StdHotspotsRep2.broadPeak.gz
## 
## Importing pre-computed narrowPeak files.
## Importing pre-computed broadPeak files.
## 
## Saving results ==>  /tmp/RtmpiiRuDp/file18cf2ccce9e4_PeakyFinders_grl.rds

Pre-computed peak: subset

By default, the genome-wide files are imported. However, you can also query specific subset of these peak files defined by query_granges.

Here, we’ll just construct query for all of chromosome 22 using the helper function get_genome

query_granges <- PeakyFinders::get_genome(genome = "hg19",
                                          keep.chr = 22)
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
## 
##     strsplit
grl2 <- PeakyFinders::import_peaks(
    ids = "GSM945244",
    searches = construct_searches(keys = c("narrowpeak",
                                           "broadpeak")),
    builds = "hg19",
    query_granges = query_granges,
    query_granges_build = "hg19") 
## Processing id(s).
## 1 unique GEO id(s) identified.
## dat is already a GRanges object.
## Querying 1 id(s) from: GEO
## Processing id: >>> GSM945244 <<<
## Determining available file types.
## Found file link(s) for 2 categories.
## narrowpeak : 
## >>> https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM945nnn/GSM945244/suppl//GSM945244_hg19_wgEncodeUwHistoneA549H3k04me3StdPkRep1.narrowPeak.gz
## >>> https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM945nnn/GSM945244/suppl//GSM945244_hg19_wgEncodeUwHistoneA549H3k04me3StdPkRep2.narrowPeak.gz
## broadpeak : 
## >>> https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM945nnn/GSM945244/suppl//GSM945244_hg19_wgEncodeUwHistoneA549H3k04me3StdHotspotsRep1.broadPeak.gz
## >>> https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM945nnn/GSM945244/suppl//GSM945244_hg19_wgEncodeUwHistoneA549H3k04me3StdHotspotsRep2.broadPeak.gz
## grlist is already in the output_build format. Skipping liftover.
## 
## Importing pre-computed narrowPeak files.
## Importing pre-computed broadPeak files.
## Subsetting peaks.
## 
## Saving results ==>  /tmp/RtmpiiRuDp/file18cfd40e56_PeakyFinders_grl.rds
knitr::kable(
    head(data.frame(grl2$GSM945244))
)

|| || || ||

Call peaks

When no pre-computed peaks are available, you can instead use import_peaks to compute peaks from bigWig or bedGraph files. The default peak calling method is currently MACS3 via the MACSr package.

Note:

By default, peaks are called using whole-chromosomes, from which a subset specified by query_granges is returned. This ensures that an appropriate background is used when computing peaks. You can also call peaks using the whole genome at once, but this is usally less computationally efficient than splitting the task by chromosome (i.e. setting split_chromosomes=TRUE when query_granges is not provided).

By default, import_peaks (and the subfunction call_peaks) automatically infer a reasonable cutoff threshold based on an iterative testing procedure. See ?PeakyFinders::call_peaks for more details.

Warning:

Peak calling with MACS3/MACSr is not currently compatible with Windows OS.

### rtracklayer::import.bw() is currently broken on Windows.
if(.Platform$OS.type!="windows"){
    
    grl3 <- PeakyFinders::import_peaks(
    ids = "GSM945244",
    searches = construct_searches(keys = "bigwig"),
    builds = "hg19",
    query_granges = query_granges,
    query_granges_build = "hg19"
    ) 
    
    knitr::kable(
        head(data.frame(grl3$GSM945244))
    )
} 
## Processing id(s).
## 1 unique GEO id(s) identified.
## dat is already a GRanges object.
## Querying 1 id(s) from: GEO
## Processing id: >>> GSM945244 <<<
## Determining available file types.
## Found file link(s) for 1 category.
## bigwig : 
## >>> https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM945nnn/GSM945244/suppl//GSM945244_hg19_wgEncodeUwHistoneA549H3k04me3StdRawRep1.bigWig
## >>> https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM945nnn/GSM945244/suppl//GSM945244_hg19_wgEncodeUwHistoneA549H3k04me3StdRawRep2.bigWig
## grlist is already in the output_build format. Skipping liftover.
## 
## Computing peaks from bigWig file.
## Importing chromosome(s): chr22
## + Importing as: BigWig
## Returning BigWig of length 242,161
## Fixing GRanges seqinfo.
## Loading required namespace: TxDb.Hsapiens.UCSC.hg19.knownGene
## Writing bigWig subset as bedGraph.
## File already in bedGraph format.
## Analyzing cutoff thresholds.
## INFO:root:Read and build bedGraph...
## INFO:root:Analyze cutoff vs number of peaks/total length of peaks/average length of peak
## INFO:root:Write report...
## INFO:root:Done
## Calling peaks.
## INFO:root:Read and build bedGraph...
## INFO:root:Call peaks from bedGraph...
## INFO:root:Write peaks...
## INFO:root:Done
## 280 peaks called.
## Importing chromosome(s): chr22
## + Importing as: BigWig
## Returning BigWig of length 268,442
## Fixing GRanges seqinfo.
## Writing bigWig subset as bedGraph.
## File already in bedGraph format.
## Analyzing cutoff thresholds.
## INFO:root:Read and build bedGraph...
## INFO:root:Analyze cutoff vs number of peaks/total length of peaks/average length of peak
## INFO:root:Write report...
## INFO:root:Done
## Calling peaks.
## INFO:root:Read and build bedGraph...
## INFO:root:Call peaks from bedGraph...
## INFO:root:Write peaks...
## INFO:root:Done
## 259 peaks called.
## Subsetting peaks.
## 
## Saving results ==>  /tmp/RtmpiiRuDp/file18cf61a73d7b_PeakyFinders_grl.rds

|| || || ||

Parallelization

To speed up the process of importing and/or calling peaks, you can take advantage of the parallelisation features in import_peaks.

For example, if you want to call peaks across the entire genome from a given bigWig file, you could speed up this process by dividing the task by chromosome (split_chromosomes=TRUE) and distributing it across multiple cores or compute nodes (nThread=2 or greater).

Internally, PeakyFinders takes a number of steps to optimise these parallel processes, so you don’t have to worry about it as much. This helps to avoid common issues stemming from excessive numbers of queries, both to the servers where the peak data is hosted, and to the C libraries that R relies on to makes these queries.

Note:

It’s usually a good idea to leave several cores free when running parallel processes, so that you can still have enough resources to use your computer while the function is running. In the example below, I’m using 10/12 cores on my local laptop.

Warning:

Parallelisation is not currently compatible with Windows OS. If Windows OS is detected, import_peaks will automatically switch to serial processing with nThread=1.

grl4 <- PeakyFinders::import_peaks(
    ids = "GSM945244",
    searches = construct_searches(keys = "bigwig"),
    split_chromosomes = TRUE,
    nThread = 10
    )  

Session Info

utils::sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4   stats    graphics utils    methods  base    
## 
## other attached packages:
##  [1] BSgenome.Hsapiens.UCSC.hg19.masked_1.3.993
##  [2] BSgenome.Hsapiens.UCSC.hg19_1.4.3         
##  [3] BSgenome_1.65.1                           
##  [4] rtracklayer_1.57.0                        
##  [5] Biostrings_2.65.0                         
##  [6] XVector_0.37.0                            
##  [7] GenomicRanges_1.49.0                      
##  [8] GenomeInfoDb_1.33.3                       
##  [9] IRanges_2.31.0                            
## [10] S4Vectors_0.35.0                          
## [11] BiocGenerics_0.43.0                       
## [12] PeakyFinders_0.99.0                       
## [13] BiocStyle_2.25.0                          
## 
## loaded via a namespace (and not attached):
##   [1] rjson_0.2.21                           
##   [2] ellipsis_0.3.2                         
##   [3] rprojroot_2.0.3                        
##   [4] fs_1.5.2                               
##   [5] bit64_4.0.5                            
##   [6] interactiveDisplayBase_1.35.0          
##   [7] AnnotationDbi_1.59.1                   
##   [8] fansi_1.0.3                            
##   [9] xml2_1.3.3                             
##  [10] R.methodsS3_1.8.1                      
##  [11] cachem_1.0.6                           
##  [12] knitr_1.39                             
##  [13] jsonlite_1.8.0                         
##  [14] Rsamtools_2.13.3                       
##  [15] dbplyr_2.1.1                           
##  [16] png_0.1-7                              
##  [17] R.oo_1.24.0                            
##  [18] shiny_1.7.1                            
##  [19] BiocManager_1.30.18                    
##  [20] readr_2.1.2                            
##  [21] compiler_4.2.0                         
##  [22] httr_1.4.3                             
##  [23] basilisk_1.9.0                         
##  [24] assertthat_0.2.1                       
##  [25] Matrix_1.4-1                           
##  [26] fastmap_1.1.0                          
##  [27] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
##  [28] limma_3.53.2                           
##  [29] cli_3.3.0                              
##  [30] later_1.3.0                            
##  [31] prettyunits_1.1.1                      
##  [32] htmltools_0.5.2                        
##  [33] tools_4.2.0                            
##  [34] glue_1.6.2                             
##  [35] GenomeInfoDbData_1.2.8                 
##  [36] dplyr_1.0.9                            
##  [37] grDevices_4.2.0                        
##  [38] rappdirs_0.3.3                         
##  [39] Rcpp_1.0.8.3                           
##  [40] Biobase_2.57.1                         
##  [41] jquerylib_0.1.4                        
##  [42] pkgdown_2.0.3.9000                     
##  [43] vctrs_0.4.1                            
##  [44] ExperimentHub_2.5.0                    
##  [45] xfun_0.31                              
##  [46] stringr_1.4.0                          
##  [47] mime_0.12                              
##  [48] lifecycle_1.0.1                        
##  [49] restfulr_0.0.13                        
##  [50] XML_3.99-0.9                           
##  [51] AnnotationHub_3.5.0                    
##  [52] zlibbioc_1.43.0                        
##  [53] basilisk.utils_1.9.0                   
##  [54] MatrixGenerics_1.9.0                   
##  [55] ragg_1.2.2                             
##  [56] hms_1.1.1                              
##  [57] promises_1.2.0.1                       
##  [58] SummarizedExperiment_1.27.1            
##  [59] parallel_4.2.0                         
##  [60] GEOquery_2.65.2                        
##  [61] yaml_2.3.5                             
##  [62] curl_4.3.2                             
##  [63] memoise_2.0.1                          
##  [64] reticulate_1.25                        
##  [65] sass_0.4.1                             
##  [66] datasets_4.2.0                         
##  [67] biomaRt_2.53.2                         
##  [68] stringi_1.7.6                          
##  [69] RSQLite_2.2.14                         
##  [70] BiocVersion_3.16.0                     
##  [71] BiocIO_1.7.1                           
##  [72] desc_1.4.1                             
##  [73] GenomicFeatures_1.49.4                 
##  [74] filelock_1.0.2                         
##  [75] BiocParallel_1.31.6                    
##  [76] rlang_1.0.2                            
##  [77] pkgconfig_2.0.3                        
##  [78] systemfonts_1.0.4                      
##  [79] bitops_1.0-7                           
##  [80] matrixStats_0.62.0                     
##  [81] evaluate_0.15                          
##  [82] lattice_0.20-45                        
##  [83] purrr_0.3.4                            
##  [84] GenomicAlignments_1.33.0               
##  [85] bit_4.0.4                              
##  [86] tidyselect_1.1.2                       
##  [87] here_1.0.1                             
##  [88] magrittr_2.0.3                         
##  [89] bookdown_0.26                          
##  [90] R6_2.5.1                               
##  [91] generics_0.1.2                         
##  [92] DelayedArray_0.23.0                    
##  [93] DBI_1.1.2                              
##  [94] MACSr_1.5.0                            
##  [95] pillar_1.7.0                           
##  [96] KEGGREST_1.37.0                        
##  [97] RCurl_1.98-1.6                         
##  [98] tibble_3.1.7                           
##  [99] dir.expiry_1.5.0                       
## [100] crayon_1.5.1                           
## [101] utf8_1.2.2                             
## [102] BiocFileCache_2.5.0                    
## [103] tzdb_0.3.0                             
## [104] rmarkdown_2.14                         
## [105] progress_1.2.2                         
## [106] grid_4.2.0                             
## [107] data.table_1.14.2                      
## [108] blob_1.2.3                             
## [109] digest_0.6.29                          
## [110] xtable_1.8-4                           
## [111] tidyr_1.2.0                            
## [112] httpuv_1.6.5                           
## [113] regioneR_1.29.0                        
## [114] R.utils_2.11.0                         
## [115] textshaping_0.3.6                      
## [116] bslib_0.3.1