Anyway, figuring my refresher might refresh or bring useful news to the attention of others, too, I thought I'd share some aha! moments of "Ooh, that's useful!" or "Ouch, that's dangerous!" (The latter probably explaining some rather weird, very difficult to find and work around bugs I've met in my day.)
First things first. The basics: the RegExp
/./), doesn't match newline (generalized in the Unicode sense) characters. This wasn't much news to me, though it's something I have once in a while kludged around where speed is not of the essence, by starting out with replacing most culprits in the match string with normal space prior to applying the regexp, i e (and this too most likely misses a few cases in above-latin-1-land)
mystring = mystring.replace( /[\r\n]+/, ' ' );-- so I can then write match patterns where
.matches any character.
A much better solution is of course to craft a character class which does match any character. My first thought,
[^], turned out to play well with Mozilla 1.5 but pretty much nowhere else (it probably breaks the ecmascript standard, though I didn't bother taking the time to verify), and my second try,
[\s\S](any whitespace or non-whitespace character), works nicely in Mozilla, Opera (9) and Internet Explorer (6) alike.
Next up, a reminder of what the "multiline" RegExp flag (
/^foo$/m) does: it widens the semantics of
$to match not just start and end of string, but start and end of line too. Only that. Nothing else. (Really!)
Finally, the RegExp flag "global" (
/lotsaplaces/g). Here be dragons! But first, a useful feature I wasn't aware of:
String.prototype.match, when fed a global-match RegExp, returns an Array with all match occurrences in the string (the full match; no match groups). I have usually been messing with loops around
RegExp.prototype.execto get hold of these before, often for very little reason, figuring
mystring.match( regexp )and
regexp.exec( mystring )were doing identically the same. In the case of global RegExps, they don't; exec always gives you all the match groups, one match per call.
And here comes the pitfall. In order to be able to match later occurrences of a match on the same string when called repeatedly, the RegExp object's
lastIndexproperty gets set to the next character position after the last found match (or zero, when no match was found). The next search, using that very same RegExp object, which remembers how far into the string it should start looking, will start looking there, on whatever string fed to it. Unless you start out by zeroing
lastIndexbefore performing your
This probably rarely bites people who write wasteful code that never reuses previously instantiated objects. If you do choose to keep a regexp object around, though, it's easy to miss that it carries around stateful baggage from prior uses, and needs to be handled with care. I'll be sure to guard my methods that take RegExp parameters better from this kind of very hard to find bugs that might remain undetected too, for the sheer obscurity of it. Debug printouts (
RegExp.prototype.toSource()) of the RegExp object doesn't mention the state of the
lastIndexproperty, and if your
test()for occurrences of the character "h" in "hi ho, hi ho, it's off to work we go" returns false once every five calls, it's likely to go unnoticed, or, at best, yield a bug report stating that "sometimes, something doesn't work here".
The cases where I believe I have run into bug due to this easy-to-miss fringe case is with really freaky huge regular expressions (kept cached) to parse out data that once in a while could choose to terminate early (before having iterated through the full match set, where
lastIndexwould be reset to zero automatically), and leave a non-pristine RegExp with state baggage from last run, missing early matches on the next run, on some other data set.
So be sure to respect the global flag, and whenever you use the
exec()methods on a RegExp object Not Instantiated Here (evil twin of the Not Invented Here rule), either start off zeroing its lastIndex property, or better still (so you don't wreck state for your caller) start off making a copy the passed RegExp for your own use --
new RegExp( passed_regexp )(and don't forget the "new" keyword, or you get the same object back), which gets its own zeroed lastIndex.
It's in good functional style never to perform destructive operations on passed parameters (unless your method is all about destructive modification, such as to populate some passed object with data), and it's very easy to forget that using a RegExp for some testing and matching is one of these destructive operations. Wear that seat-belt, especially if you ever publicize or otherwise share your code with others. You, and many others, will be glad you did, all those times where horribly weird errors didn't occur and horribly useless bug reports didn't get filed. Thanks.
But you won't ever be given treats for your consideration. Here, have a hug from me, instead. I'll love you for it, anyway, and isn't that something too? ;-)