Robust import / export of comma separated values or spreadsheets into a web-application

Sabtu, 26 Mei 2012

Described herein is a freely available library for importing and exporting any kind of separated values. As such flatfile databases (e.g. FASTA format), do qualify. Only requirement is providing a text-resource, which can be local or remote and may even be binary to some extend. In the case of an external resource, the access mode can be optionally specified, that is whether transmission should occur synchronous or asynchronous (default).
Consider this post a follow up to my previous 'musings' on csv-parsing.

Introduction
As stated once in my previous post, the one constant one may count on in the comma separated value 'business', are values which are separated. Frequently separators may indeed be commas. However,using csv as a file extension for arbitrarily separated values, as opposed to individual file-extension, allows quick recognition of spreadsheets, assigned to a typical file system icon in addition to file previews in modern file-system explorers.
Undoubtedly, csv provides it sets of perks, which is the reason for its frequent and wide adoption, yet sometimes a user-base entails scandalous file output by some programs/scripts. Whilst programs and scripts are easily fixed, large existing datasets, derived from these tools, on the other hand are not. Notably one of csv perks is its ubiquitousness. Practically this means that there are virtually no scientists who have not ever come into contact with this file format.

Importing crudely or badly formatted csv, separated values

Based on my previous post, a robust, highly configurable implementation is presented along with a demo, which incidentially is a small facet of an upcoming project. The JavaScript library, which may be run via Node.js in a server-like manner, supports importing and exporting of arbitrarily separated values, from an ASCII text file or HTML input element, even if the values are badly and inconsistently formatted. The separator is automatically detected.

CSV can be mis-formatted irreparably, without additional metainformation. For instance, 1.2.3.4.4.523.32.4.324.432.434 , wherein the dot poses as a separator and decimal mark at the same time. Wikipedia provides a list of 'Countries using Arabic numerals with decimal point' - particularly useful to quickly narrow down possible countries of origins of a person. A way around this, is by generally separating values with canonical smilies ;) - no, that's the wrong one.

Following are examples of badly formatted csv's

Examples of badly formatted csv's

Library features and support

Variable comment characters
Multiline/Blockcomments, heredoc and partial comments spanning one or several lines e.g. content /*..\n...*/ content2
Automatic variable separator detection
Definition of commonly encountered separators
type interference and reduction to primitive types, e.g. Dates, Floats, Integers, Objects, Strings etc..
variable nullable type and missing value imputation via a callback function
relatively fast (improvable), avoids speed penalizing RegEx constructs such as lookaheads
crudely formatted csv, e.g. too many separators on a given line, or inconsistent quoting
separators may occur within quoted text fields
mixed quotes
tolerance value for the allowed number of fields extending over the detected number of fields
cutoff number to curb fields going over the detected number of fields (_fieldcutoff = true)

Taken all features into considerations, this makes csv-lib (interim name) one of the most robust separated values import libraries available for web-applications

Possible features:

Dynamic changing of the separator within the scope of one process could easily be implemented, at the cost of a few extra CPU cycles
Multithreaded processing through the use of web-workers (will be implemented in the first release)

Implementation Notes

Lines or individual characters can be commented-out in typical C-syntax notation. Multiline comments are denoted with
/*...
...
...
...*/
Blockcomments within the same line are regarded through regular expression- based deletion of the comment. Wikipedia provides an overview of common commenting notations.

Content is processed linewise by splitting at possible newline feed and carriage return characters or a combination of the two. In principal a non-quoted field may contain any content except for a separator character. Allowed content includes binary content, meaning content containing characters beyond or below the range of printable ASCII symbols. The csv-assigned separator may be dynamic over the course of importing different csv's, and is automatically detected. At least four lines must be provided for detection to succeed.
Additionally speed matters in data-import and export. Thus the use of simple regular expressions applicable to the majority of cases and complex ones for minor exceptions is advisable. Character masking is a strategy often used to circumvent the necessity of complex regular expressions incurring lookaheads and lookbacks, and/or to conditionally apply such complex regular expressions with smaller chunks of data. Notably, regular expressions ultimately compile into conditional clauses to be performed within a regular expression engine or parser., but are often difficult to implement in an optimal manner.
Masking works by conditionally replacing characters, which impede simple solutions due to ambiguity, through a character sequence which will not occur in a given text. After applying simple processing routines, or further breaking down the data, the masked sequences are returned to their original form which incurs minimal processing cost.

Use csv-lib freely (MIT license) and design aspects of this project under the CC-BY-SA v3 licence, with an environmentally-friendly bug-spray handy.

Demo

Fig 1. A badly formatted csv table is imported from a textfield. Empty fields are queried from different web-resources.
Dynamically inserted values are shown in green. This type of user-interface aspect allows for on-the-fly editing, without impediment by UI-based table-configurations and table-editing. csv is the most concise text-based table format with which virtually all scientists are familiar with.

You can check out the Compound-csv example below and type in some compound names on your own (Coming soon! -> I still need to extract functionality and make it into a standalone demo). There are dozens of auto-completion supported columns for chemical compounds. Basically column-names decide the kind of information that will be retrieved for a given compound-related column e.g. logP, logD,InChIKey, InChI, Smiles, Weight or MW, Formula, TPSA (Topological polar surface area), charge, volume, exactmass, monoisotopicmass, complexity,...). Casing does not matter, and natural ambiguity in column-naming is provided by assigning multiple names e.g. MW or molweight to the same ontology e.g. Molecularweight
Information retrival in the standalone blog-demo is limited to Pubchem using its new RESTful interface for the Pubchem PUG API. In actuality the columns are linked to data-providers, providing you with key information such as solubility for a given bulk solvent, as frequently required in separation sciences, especially when dealing with such heterogeneous groups of compounds as metabolites.
Herein the true power of interactive csv-table lies. To auto-fill a value just set it null (i.e. delete it). To undo your action and restore a self-provided value, hit CTRL+Z.
From a user-accessibility perspective, the system is amenable to powerusers as well as casual users with equal ease and editing speed. Lastly the demo makes use from the loosely coupled, lightweight javascript library is-lib which will see its first release soon - major restructuring pending.

That's all. Have a great weekend ahead.

KETIKAN

Robust import / export of comma separated values or spreadsheets into a web-application

Asma Nadia Sabtu, 26 Mei 2012 edit Tags: analysis, Database, design, DOM, flatfile, Javascript, multitasking, project, RegExp