KETIKAN

Jumat, 18 Mei 2012

Introduction

I didn't expect to churn out another post so soon, neither did I anticipate the act of posting being the result of the complexity of regex parsing for something so simple as CSV, leading to the fallacy of writing simple regular expressions in well under three minutes and calling it a day (or night - depending on how your clock-genes tick).

Indeed few would argue the simplicity of character separated text value formats. Their prevalence, persisting throughout the decades, makes this file-format support crucial.

As a 2D data structure csv is extremely efficiency and offers at least four data-types: quoted strings, NULL values (empty fields), integers and floats and arguably references or constants. Multidimensional structures can be created if the file-system structures are used to extend further dimensions - not uncommon in natural science data and scientific data provision (for instance time-series datasets).

JS RegEx - A road to nowhere

Here is an example of what did not work in JavaScript (current V8 engine), including dozens of wild variations "4-stilbenol;aspirin; 1034;3434;3434;3434;34 ;34;".match(/([a-zA-z0-9#+\s _-]+[;,\t]?){3,}([a-zA-z0-9#+\s _-]+)?/g) - the variations extend to all kind of non-matching group (?:...) trials
The following expression actually works somewhat:
"4-stilbenol;aspirin; 1034;3434;3434;3434;34 ;34;".match(/(?:[a-zA-Z0-9#+\s_-]+;)(?:[a-zA-Z0-9#+\s_-]+$)?/g)
yielding
["4-stilbenol;", "aspirin;", " 1034;", "3434;", "3434;", "3434;", "34 ;", "34;"]

A google search will bring up results (stackoverflow, kimgentes-blog) with solutions non-applicable for roughly formatted csv. Even just for knowingly indulging a purely hypothetical case or rather outlook, which would break down due to speed issues as soon as file greater than >10MB are imported.

For comma separated values, the one constant that can be counted on, are commas that separate values.

Here is another wild RegEx example that will not help your case:

^(("(?:[^"]|"")*"|[^,]*)(,("(?:[^"]|"")*"|[^,]*))*)$

Algorithm: Fast import and automatic separator detection

Despite a claim towards the contrary, csv constitutes almost a prime example for string-splitting over regex .

1. Split the data at \r\n, \n, \r or use a function for linewise reading

-. common separators are ':', ',', ';', ':', '||', '\t', ' ' i.e. a sequence of three to four spaces [ ]{3,4}

-. // Testing phase

2. For each of the separators, walk through the first four lines, and count the splits split('sep').length or the string's frequency strcnt('sep')

3. Add them up as follows: if (1st_line sepfreq -2nd+3rd -4th...) < tolerance) then separator = found; break; tolerance may be +/-1 on a generous day :)
-. // Load phase
4.* If text-values contain the separator, simply restore simple unquoted values to full quotes and split with a quote concatenated to the separator. This is computationally less demanding than 'RegEx'ing around the issue of whether a separator lies within or outside a text value-field

4. For all lines, split them with the found-separator {

5. For all splits, trim them of white-spaces (e.g. replace(/^(["'])|(["'])$/g,'')), single and double quotes

..and assign the splitted values to the lines in a 2D Array. Herein prior type-casting and missing value imputation may be performed.
}

Notably, the procedure takes > 1sec on a 6 year old PC in a rough C++ implementation for a 10MB csv file.

Conclusion:

The outcome is a 2d-matrix ready for any further computation. I should note that this algorithm is included in an upcoming comprehensive pre-parsing detection library, soon to be released for bioinformatic web-projects.

JavaScript Implementation
Result:

CSV Examples:

KETIKAN

Regular expressions for csv, comma separated values parsing / import? - Don't! Go the 'separate' way in JavaScript

Asma Nadia Jumat, 18 Mei 2012 edit Tags: dynamic, Javascript, RegExp, Regular expressions