PathLocDB - Pathway Localization database with subcellular localization of metabolic pathways and participant enzymes - parsing the flat data

I recently tweeted about PathLocDB, a novel and comprehensive database for localizing biological pathways and the pathway's driving catalytic force  - enzymes.  The database was created and curated by Min Zhao and Hong Qu (The open access publication is recommended for creative input on how to bring the data into a wider context).
The entire datasets are neatly available from this link , but will require some preprocessing if one ought to bring the data into a fast-accessible format (i.e. performant) depending on the user requirements. As such some fields are comprehensions of values with additional value delimiters.
An examplary entry (PLSP262 PL...Plant; SP...SuperPathway #ID) will look like this. Localizations are provided, as well as the associated Organism with that entry, whereas the Organisms -IDs have to be further resolved. Additional information for the Protein, the Pathway and the Protein Family is available within the dataset (which I will put up in my Datahub as SQL format shortly). Since the data contains directly the associated protein sequence of the data-entry, searching by homologies is straightforward as well, and makes the data more amenable to meta-analysis.

The PathLocDB Database (as self described) allows:
 " 1. searching and browsing the metabolic pathways by their subcelluar localizations and organisms
  2. systematic comparing the localization profiles of metabolic pathways between different organisms
  3. discover the potential regulatory mechanisms and suspicious localization of metabolic pathways
  4. clarify the pathway boundary from the view of subcellular localization
  5. discover the mechanism of intermediates communication between different subcellular localizations "


Data files
Despite the web-interface, all data is additionally provided as flat-files (files with usually tab separated values, with either '\r\n' or \n' as line terminator), in acknowledgement of open-access policies. Such files can easily be imported in an existing MySql or PostgreSql database via the LOAD DATA INFILE MySQL command (which should usually be preferred over the command line wrapper mysqlimport

Missing Data Headers
Unfortunately the headers of the provided flatfiles, were missing - compelling enough to write up this quick blog-post. 
The field's header could be quickly recreated  by piecing together the field-values from the web-interface, and information within the publication albeit incompletely: one of the fields with a three digit number remains unexplained.


Notes:

Getting the all the fields into the desired end-format will likely take at least two runs. Truncate your table intermittently.

Preprocessing and Parsing out values



LihatTutupKomentar