Benchmarking Di4

against Di3, bedtools, and BEDOPS

Benchmark package

We prepared a package to benchmark Di4 against Di3, bedtools and BEDOPS. The package can be downloaded from this link. The package contents are as it follows:

  • files\; this folder contains 2000 files (ENCODE narrowPeak) downloaded from ENCODE repository.


  • list; the narrowPeaks in files folder are grouped in 9 datasets labeled: A1, A2, A3, A4, B1, B2, C1, C2, and C3. The list of narrowPeaks in each of these datasets is given in 9 text files under list folder, where the text files are named with the label of dataset they belong to.


  • copy.py; this is a python script that takes a dataset label (e.g., A1), and copies all the narrowPeaks belonging to that dataset (as given by text files under list folder) from files folder to a new folder named as dataset label. This script can be executed as the following:

    python copy.py a1
    

    This script can be downloaded individually from this link.


  • run.py; this is a python script that runs bedtools and BEDOPS. The syntax to run this script is at it follows:

    python run.py TOOL_NAME DATASET [--on-the-fly]
    

    where TOOL_NAME can be either --bedtools or --bedops, and DATASET is a dataset label (e.g., a1). When the --on-the-fly flag is set, this script measures runtime considering both preprocessing (sorting data) and processing time; and if this flag is not provided, this script measures runtime considering only processing time. This script can be downloaded individually from this link.


  • ref.narrowpeak; this file is used as a reference for running bedtools and BEDOPS intersect functions.


  • README.pdf; this files contains a thorough explanation on how to benchmark Di4 against Di3, bedtools and BEDOPS using the provided datasets and scripts. The README file can be downloaded individually from this link.

Run Giggle

We prepared a a python script to run Giggle. The syntax to run this script is as the following:

python giggle.py PATH_TO_DATA_TO_BE_INDEXED PATH_TO_QUERY_DATASETS

This script runs Giggle to index the data in PATH_TO_DATA_TO_BE_INDEXED, then queries the indexed data using the query samples in PATH_TO_QUERY_DATASETS. Note, this python script runs each query 10 times.

This python script requires giggle and bedtools to be installed and configured. After installation, open this script in a text editor and update the GIGGLE_SOURCE variable to the path poining to your giggle installation path.

The links to the source of query samples we used for benchmarking Di4 against Giggle are available from this text file.

The Roadmap Epigenomics dataset which we used for benchmarking Di4 against Giggle, can be downloaded from this text file.

NOTE: if you are indexing a large number of samples (e.g., more than 100), then you would need to run ulimit -Sn 16384 before executing the python script.