Skip to main content

Parser Configuration

MSPC uses GeUtilities to parse BED files. Out of box, MSPC uses the default configuration of GeUtilities; accordingly, it expects an input BED file to be in format similar to the following example:

chr1    9999    10039   MACS_peak_1 2.42
chr1 10101 10190 MACS_peak_2 3.23
chr1 29303 29382 MACS_peak_3 2.44
chr1 32600 32680 MACS_peak_4 4.08
chr1 32726 32936 MACS_peak_5 17.5
chr1 34689 34797 MACS_peak_6 5.82
chr1 35083 35124 MACS_peak_7 4.59

The fifth column represents the p-value, and by default, MSPC expects it be in -Log10 format (read how to adjust this configuration). Also, by default, MSPC expects each peak to have a p-value; otherwise, that peak will not be parsed into MSPC.

A BED file may have additional columns; however, the content of those columns will not be parsed into MSPC.

The GeUtilities parser is highly customizable that allows parsing BED files represented differently. The following JSON object contains all the configuration attributes, each discussed in details in the following sections.

{
"Chr":0,
"Left":1,
"Right":2,
"Name":3,
"Value":4,
"Strand":-1,
"Summit":-1,
"Culture":"en-US",
"PValueFormat":1,
"DefaultValue":0.0001,
"DropPeakIfInvalidValue":true
}

Column Order

A BED file is a plain-text and tab-delimited file, it has multiple columns, and the type of data in each column follows widely adopted standards with a number of variations. In order to correctly parse different formats of BED files (standard or non-standard), without requiring the users to convert them to a common format, MSPC allows users to configure its parser by specifying the number of columns that contain required information.

For instance, if your samples are stranded, you may configure the parser as the following.

chr1    633859  634162  peak_1  137 .
chr1 1079427 1079669 peak_2 67 .
chr1 1109848 1110187 peak_3 91 .

Create a JSON file as the following:

{"Strand": 5}

and execute MSPC as the following passing the parser configuration :

mspc.exe -i inputs*.bed -r bio -w 1e-4 -s 1e-8 -p parser_config.json

where parser_config.json is the filename of the file containing the JSON object you created above.

p-Value Format

By default, MSPC expects p-values in a BED file to represented in -log10(p-value) format. However, some tools produce p-values in -10log10(p-value), -100log10(p-value), or actual p-value (without log10 scale). To set MSPC to correctly parse peaks according to their p-values representation, use the "PValueFormat" attribute in the parser configuration JSON object according to the following table:

FormatJSON attributeExample p-value from BED fileParsed p-value
Same as input"PValueFormat":00.0010.001
-log10(p-value)"PValueFormat":130.001
-10log10(p-value)"PValueFormat":2300.001
-100log10(p-value)"PValueFormat":33000.001

(See p-value formats of GeUtilities.)

Peaks with Missing or Invalid p-Values

Some BED files may not a valid p-value for some or none of the peaks in that file; for instance, BED files generated per cell in single-cell assays do not commonly provide a p-value per peak. Hence, following cases are possible:

  • missing p-value for some peaks:

    chr1    9999    10039   MACS_peak_1 2.42
    chr1 10101 10190 MACS_peak_2 3.23
    chr1 29303 29382 MACS_peak_3
  • none of the peaks have a p-value (commonly in single-cell assays):

    ```
    chr1 9999 10039
    chr1 10101 10190
    chr1 29303 29382
    ```

By default, MSPC drops all the peaks that do not have a valid p-value. To set MSPC to read such peaks, use the following attribute in parser configuration JSON object:

"DropPeakIfInvalidValue":false

With this configuration, MSPC sets the p-value of peaks with invalid/missing p-value to 1E-8 (see this initialization). To change the default p-value, use the following attribute in parser configuration JSON object:

"DefaultValue":0.0001

Hence, to read the previous example, set the parser configuration JSON object as the following:

{
"Chr":0,
"Left":1,
"Right":2,
"DefaultValue":0.0001,
"DropPeakIfInvalidValue":true
}

Culture info

Numbers can be formatted differently depending on the operating system's culture (or locale) settings. For instance, different cultures use dot (.), comma (,), or forward slash (/) characters as decimal separators. Accordingly, the following numbers formatted following different cultures are equal:

- 1.234
- 1,234
- 1/234

MSPC parses numbers according to the culture setting of the execution environment. For instance, on a operating system with its region set to US, MSPC considers dot (.) as a decimal separator.

In some scenarios input data format may not adhere with the culture setting of the operating system. For instance, a scientist in Europe (where a comma character (,) is used as decimal separator) analyzing data generated by a collaborator from US (where a dot character('.') is used as decimal separator). For such scenarios you may explicitly specify the culture info of the input data using the Culture attribute in parser configuration json file. For instance, the following configuration sets MSPC to read numbers formatted in US style independent from culture setting of the operating system:

{
"Chr":0,
"Left":1,
"Right":2,
"Name":3,
"Value":4,
"Culture":"en-US"
}

You may refer to the Language tag column in the list of language and region names supported by Windows available in this page.

info

The numbers written to the output files generated by MSPC are formatted according to the operating system's settings, independent from the culture setting provided for the input parser.