Compute correlation matrix — compute

Compute correlation matrix on all peak files.

compute_corr(
  peakfiles,
  reference = NULL,
  genome_build,
  keep_chr = NULL,
  drop_empty_chr = FALSE,
  bin_size = 5000,
  method = "spearman",
  intensity_cols = c("total_signal", "qValue", "Peak Score", "score"),
  return_bins = FALSE,
  fill_diag = NA,
  workers = check_workers(),
  save_path = tempfile(fileext = ".corr.csv.gz")
)

Arguments

peakfiles

A list of peak files as GRanges object and/or as paths to BED files. If paths are provided, EpiCompare imports the file as GRanges object. EpiCompare also accepts a list containing a mix of GRanges objects and paths.Files must be listed and named using list(). E.g. list("name1"=file1, "name2"=file2). If no names are specified, default file names will be assigned.

reference

A named list containing reference peak file(s) as GRanges object. Please ensure that the reference file is listed and named i.e. list("reference_name" = reference_peak). If more than one reference is specified, individual reports for each reference will be generated. However, please note that specifying more than one reference can take awhile. If a reference is specified, it enables two analyses: (1) plot showing statistical significance of overlapping/non-overlapping peaks; and (2) ChromHMM of overlapping/non-overlapping peaks.

genome_build

The build of **all** peak and reference files to calculate the correlation matrix on. If all peak and reference files are not of the same build use liftover_grlist to convert them all before running. Genome build should be one of hg19, hg38, mm9, mm10.

keep_chr

Which chromosomes to keep.

drop_empty_chr

Drop chromosomes that are not present in any of the peakfiles (default: FALSE).

bin_size

Default of 100. Base-pair size of the bins created to measure correlation. Use smaller value for higher resolution but longer run time and larger memory usage.

method

Default spearman (i.e. non-parametric). A character string indicating which correlation coefficient (or covariance) is to be computed. One of "pearson", "kendall", or "spearman": can be abbreviated.

intensity_cols

Depending on which columns are present, this value will be used to get quantiles and ultimately calculate the correlations:

"total_signal" : Used by the peak calling software SEACR. NOTE: Another SEACR column (e.g. "max_signal") can be used together or instead of "total_signal".
"qValue"Used by the peak calling software MACS2/3. Should contain the negative log of the p-values after multiple testing correction.
"Peak Score" : Used by the peak calling software HOMER.

return_bins

If TRUE, returns a named list with both the rebinned (standardised) peaks ("bin") and the correlation matrix ("cor"). If FALSE (default), returns only the correlation matrix (unlisted).

fill_diag

Fill the diagonal of the overlap matrix.

workers

Number of threads to parallelize across.

save_path

Path to save a table of correlation results to.

Value

correlation matrix

Examples

data("CnR_H3K27ac")
data("CnT_H3K27ac")
data("encode_H3K27ac")
peakfiles <- list(CnR_H3K27ac=CnR_H3K27ac, CnT_H3K27ac=CnT_H3K27ac)
reference <- list("encode_H3K27ac"=encode_H3K27ac)

#increasing bin_size for speed but lower values will give more granular corr
corr_mat <- compute_corr(peakfiles = peakfiles,
                         reference = reference,
                         genome_build = "hg19",
                         bin_size = 200000, 
                         workers = 1)
#> Standardising peak files in 16,318 bins of 2e+05 bp.
#> Merging data into matrix.
#> Binned matrix size: 16,318 x 3
#> Matrix sparsity: 0.9646
#> Calculating correlation matrix.
#> Done computing correlations in 1 seconds.
#> Saving correlation results ==> /tmp/RtmpNAg45x/file243d676d7842.corr.csv.gz