Notepad++ Regular Expression Replace to make a flatfile database from a TAIR FASTA protein sequence file

Sabtu, 17 Desember 2011

On Windows I appreciate the free Notepad++ text editor with dozens of free plugins which can be selected and downloaded and installed directly within the Editor's user interface. Albeit slow in execution at times, compared to other editors (e.g. Notepad2), I never got around to truly engage myself in the RegEx features this editor-class provides, because it never seemed to work right (- usually a case for RTFM).

Apparently the set of Regular Expressions usable is quite limited, more to that in a bit.

In the following case the TAIR sequence data of the model plant Arabidopsis thaliana is prepared to become a flatfile, ready for straightforward import in MySQL.

LOAD DATA LOCAL INFILE 'DBs\\AraCyc_Enzymes_Without_Tags_BLASTset.fasta'
INTO TABLE arath_tair_enzymesequence
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n';

(Note: The assignment of the table names is: species_datasource_tablename; and I stick to lowercase. There are arguably a myriad of better naming-schemes though)

The fasta-file content looks like this:

>Q93V56 | AraCyc: AT1G01050-MONOMER | inorganic pyrophosphatase | Species: Arabidopsis thaliana | Gene: PPa1
MSEETKDNQRLQRPAPRLNERILSSLSRRSVAAHPWHDLEIGPGAPQIFNVVVEITKGSKVKYELDKKTG
LIKVDRILYSSVVYPHNYGFVPRTLCEDNDPIDVLVIMQEPVLPGCFLRARAIGLMPMIDQGEKDDKIIA
VCVDDPEYKHYTDIKELPPHRLSEIRRFFEDYKKNENKEVAVNDFLPSESAVEAIQYSMDLYAEYILHTL
RR

A dozen string-operations:

I successively replace the following text sequences, with only one RegEx required:

RegEx: '^>(.*)' -> '\1\t'
' | AraCyc: ' -> '\t'
' | Species: ' -> '\t'
' | Gene: -> ' '\t'
' | ' -> '\t'
'^>' with(->) ''

This still leaves getting rid of the whitespaces (\n) in the sequence data and adding a tab-delimiter before it:
Q93V56 AT1G01050-MONOMER inorganic pyrophosphatase Arabidopsis thaliana PPa1
MSEETKDNQRLQRPAPRLNERILSSLSRRSVAAHPWHDLEIGPGAPQIFNVVVEITKGSKVKYELDKKTG
LIKVDRILYSSVVYPHNYGFVPRTLCEDNDPIDVLVIMQEPVLPGCFLRARAIGLMPMIDQGEKDDKIIA
VCVDDPEYKHYTDIKELPPHRLSEIRRFFEDYKKNENKEVAVNDFLPSESAVEAIQYSMDLYAEYILHTL
RR

Searching for ^([A-Z]{30,} won't work in this editor-class. The Pattern Quantifiers available are limited to +,* and ?. You can see examples of the fairly limited RegEx repertoire for NodePad-plus-plus here. Additionally the RegEx-mode is limited to single-line mode.

That leaves me with using ^([A-Z][A-Z][A-Z][A-Z][A-Z][A-Z][A-Z][A-Z][A-Z][A-Z][A-Z][A-Z][A-Z]+)

Fig 1. Np++-Editor automatically highlighting the matching characters

The pattern matches (...)...(...) can then be back-referenced from \1...\n in the replace-input field.
Since the main problem is that the RegEx-search proceeds line-wise, another solution has to be applied.

The solution is to use the additional new line (\r\n or \n) separation of FASTA entries and replace them temporarily through a unique separator e.g. '\n\n' -> '\n||' . (Press CTRL+H, select extended-mode in Replace and enter \n\n -> \n||).
Now get rid of all new-lines '\n' -> '', and finally establish a new-line per entry in your data by replacing-back: '||' -> '\n'.

Conclusion:
Whether the effort is faster than a pre-made python script, and quickly replacing the Regular-expression is arguable. But if you are the visual type, this way is definitely more fun, rather than watching your temporary CPU-load spike for a few seconds. One advantage is that you can instantly undo any mistake made. The disadvantage is of course, you get to see all the mistakes made :)

PS: I generally prefer Notepad2 for Search and Replace operations, which is blazing fast and lean in resource-usage compared to Notepad++.
I wouldn't recommend doing this for files over 100MB. But then again, in this case fun is out of the question anyways.

Piecing together the information from the flatfile and TAIR, I created the following SQL table:

Be aware that the TAIR Accession-ID per se includes the chromosome-location: e.g.the 3 in 'AT3....' refers to Arabidopsis Thaliana's Chromosome 3, AT3G50740 Visit the TAIR nomenclature for more information.

Note:
Notepad2 is known to have memory-leaks if you 'push it too hard'.
As for the table creation, I originally had a VARCHAR(255) field for the description, yet some 12+ data-truncation messages during MySQL import convinced me to change the table: ALTER TABLE `arath_tair_enzymesequence` CHANGE `description` `description` MEDIUMTEXT; - Changing the table is often necessary during data import from flatfiles. In my opinion it is best to get the Tables/Databases right before the real coding starts.
As usual, all files have been uploaded in the Datahub (databases/sequence)

Others:
Wikipdia - List of Text Editors

KETIKAN

Notepad++ Regular Expression Replace to make a flatfile database from a TAIR FASTA protein sequence file

Asma Nadia Sabtu, 17 Desember 2011 edit Tags: Database, RegExp, SQL