JavaScript: Notes on javascript's RegExp object (methods compile, exec and test)

Sabtu, 27 Agustus 2011

'\s'

"s"

Yet in php and many other languages, the situation is reversed with auto-escaping:

>> "\s"
'\\s'

>> '\s'
'\\s'

It is important to note that RegExp Objects usually do not have any type of operators defined, as such RegExp Objects cannot be compared, yet instead of undefined return false, even when exactly equal.
This is because the prototypial comparison operators are used, whose functions cannot positively compare two such objects

/abcd/ == /abcd/

false

/abcd/ === /abcd/

false

/abcd/ !== /abcd/

true

Fortunately toString is defined, which allows comparison as well as obtaining a length-metric of the Object.
/abcd/+'' === ''+/abcd/
true
The javascript parser will attempt a parseInt call on the object following the plus operator

+/abcd/

NaN

When comparing the RegEx strings, heed the order of the regular expression 'flags', as becomes clear in the following example:
Flags are declared like this:
/pattern/flags;
new RegExp("pattern","flags");

1.2 Javascript >1.5 knows three 'GIM' flags:

g, Global Search: lets the RegExp search for a pattern in the entire string, and stores occurring matches in an array
i, Ignore Case: makes the regular expression case insensitive, which may however not work on non-UTF8 standard encoded text. (e.g. extended characters such as â, å,ä, æ....)
m, Multiline Input: changes the restriction of beginning of text (^) and end of text ($) to beginning of line and ending of a line, respectively.

The order of those flags, is irrelevant (search yields the position of the first match):

"32432abcd\naBcd".search(/^abcd$/img)

"32432abcd\naBcd".search(/^abcd$/mig)

"32432abcd\naBcd".search(/^abcd$/gmi)

"32432abcd\naBcd".search(/^abcd$/igm)

The flags will be positioned as gim, when passed as the second parameter of the RegExp constructor

RegExp("^abcd$", "img")

/^abcd$/gim

RegExp("^abcd$", "mig")

/^abcd$/gim

RegExp("^abcd$", "igm")

/^abcd$/gim

This makes comparison with the method toString easy. If only the regular expression pattern is to be compared the RegExp.source property can be used:

/^abcd$/gim.source

"^abcd$"

These flags are also exposed in the RegExp Object as boolean read / write properties:
RegExp.global
RegExp.ignoreCase
RegExp.multiline

/^abcd$/gim.global

true

/^abcd$/gim.ignoreCase

true

/^abcd$/gim.multiline

true

The method String.search( <String> or <RegExp>) seems of little use at first, in consideration that it only outputs the positition of the first match. Something String.indexOf does, minus the RegExp capability.
The situation can be changed by directly or indirectly altering the RegExp object which is an iterable type. More on that later.

For a more familiar search, comparable to other languages (e.g. preg_match in php) RegExp offers the method exec() which will return an array containing the strings of all matches.

In the following example null (not undefined!) will be matched, since the flag 'm' doesn't 'globalize' the ^ and $ regular expression symbol:

/^abcd$/ig.exec("32432abcd\naBCd\nAbcD\nk")

null

Only one match is returned since no patterns have been defined with the pattern delimiters (pattern), which captures and stores a pattern and (?:pattern) which doesn't capture a match.

/^abcd$/igm.exec("32432abcd\naBCd\nAbcD\nk")

["aBCd"]

the match is captured:

/(?:abcd)+/g.exec("32432abcd\naBCd\ndAbcDf\nk")

["abcd"]

In the following example, two matches are returned, as expected, but the original upper and lowercase characters have not been conserved, which is unexpected. Apparently the algorithm stores the first match, then converts everything else to lowercase. In the example underneath, three matches would be expected. (This is also reproducible with a new RegExp Object)

/^(abcd)$/igm.exec("32432abcd\naBCd\nAbcD\nk")

["aBCd", "aBCd"]

/(abcd)/igm.exec("32432abcd\naBCd\nAbcD\nk")

["abcd", "abcd"]

RegExp has another method, 'test' which returns true if a match occured else false. It's use is analogous to exec, but its worth is very limited since the same computations have to be made. To get the same result you can use:

!!/(abcdED)/g.exec("32432abcd\naBCd\ndAbcDf\nk")

false

!!/(abcd)/g.exec("32432abcd\naBCd\ndAbcDf\nk")

true

Note: Javascript interprets !null as false which is then negated, yielding true: !(!null). Be aware that such programming abbreviations make the code less comprehensible and are not recommended. You could also store the result and check it in one statement, which is again not recommended;

if( (x = /(abcd)/g.exec("32432abcd\naBCd\ndAbcDf\nk")) && !!x ) alert("true")

An interesting method in Javascript >1.5 is compile, which can be used to reinitiate the RegExp object overwriting its current regular expression, without changing its flags and other properties. In principle you could use this within a dynamic parser which improves its regex-pattern during the course of the parsing.
Since a regular expression is actually compiled into a longer, more docile structure a slight decrease in computational cost may be noticed. In principle compile could infere from its mere presence of being used as a method by developers that they invoke this method only when slight alterations on the regular expression patterns are made, thus using optimized underlying routines.
Compile is particularly useful in the situation of a loop with thousands of iterations with changing regular expression patterns, allowing reuse of the RegExp Object and thus better memory consumption. (Otherwise developers would have to manage memory consumption by purging the current RegExp by calling delete obj, within the end of the loop.

RegExp.compile(RegExp, flags)

Compile resets all properties:

var regex = /(abcd)/gi

undefined

regex.lastIndex

regex.lastIndex = 11

regex.compile("abcd")

undefined

regex.source

"abcd"

regex.ignoreCase

false

regex.lastIndex

The RegExp object is actually indirectly iterable by means of its lastIndex pointer, which stores the position o the last capture within the text. This can lead to interesting results (caveat: in some browsers):

The reason is that the RegExp Object has a read/write (integer) pointer property called lastIndex specifying the index at which to start the next match! By resetting the pointer the example provided above behaves as expected:

Mozilla's Gecko engine (may?) have a flag called sticky. You can read up on Regular Expressions on MDN.
---Regular Expression patterns:
Escaping:
\
Escapes special characters to literal and literal characters to special.
{n}, {n,}, {n,m}, *, +, ?

From the POSIX Regex documentation et al.:

Brackets are used to find a range of characters:
[abc] Find any character between the brackets
[^abc] Find any character not between the brackets
[0-9] Find any digit from 0 to 9
[A-Z] Find any character from uppercase A to uppercase Z
[a-z] Find any character from lowercase a to lowercase z
[A-z] Find any character from uppercase A to lowercase z
[adgk] Find any character in the given set
[^adgk] Find any character outside the given set
(light|dark|gray) Find any of the alternatives specified

Quantifiers match the preceding subpattern a certain number of times.
n+  Matches any string that contains at least one n
n*  Matches any string that contains zero or more occurrences of n
n?  Matches any string that contains zero or one occurrences of n
n{X} Matches any string that contains a sequence of X n's
n{X,Y} Matches any string that contains a sequence of X to Y n's
n{X,} Matches any string that contains a sequence of at least X n's
n$  Matches any string with n at the end of it
^n  Matches any string with n at the beginning of it

Lookaheads
?=n  Matches any string that is followed by a specific string n
?!n  Matches any string that is not followed by a specific string n

Pattern delimiters
(pattern) captures match
(?:pattern) doesn't capture match

Metacharacters  are characters assigned with a special meaning:
. Find a single character, except newline or line terminator
\w Find a word character
\W Find a non-word character
\d Find a digit
\D Find a non-digit character
\s Find a whitespace character
\S Find a non-whitespace character
\b Find a match at the beginning/end of a word
\B Find a match not at the beginning/end of a word
\0 Find a NUL character
\n Find a new line character
\f Find a form feed character
\r Find a carriage return character
\t Find a tab character
\v Find a vertical tab character
\xxx Find the character specified by an octal number xxx
\xdd Find the character specified by a hexadecimal number dd
\uxxxx Find the Unicode character specified by a hexadecimal number xxxx

Backreferences
\n reference the LITERAL! match of the n'th captured match.
e.g: /<(\S+).*>(.*)<\/\1>/ matches 'sample '

1.3 Replace ...todo

KETIKAN

JavaScript: Notes on javascript's RegExp object (methods compile, exec and test)

Asma Nadia Sabtu, 27 Agustus 2011 edit Tags: analysis, Javascript, RegExp