2006-09-18

RegExp peculiarities and pitfalls

I've received my copy of JavaScript: The Definitive Guide, 5th edition now, and been reading up on things. As expected, it's good. It's really good. And it's been ages since I read much of this, and much of what I pick up this time around I think I must have missed in prior editions, perhaps for no longer being on the same (lower) knowledge level as I used to be (back in the nineties, when I, too, considered javascript a toy language not worth learning in depth).

Anyway, figuring my refresher might refresh or bring useful news to the attention of others, too, I thought I'd share some aha! moments of "Ooh, that's useful!" or "Ouch, that's dangerous!" (The latter probably explaining some rather weird, very difficult to find and work around bugs I've met in my day.)

First things first. The basics: the RegExp . (literal /./), doesn't match newline (generalized in the Unicode sense) characters. This wasn't much news to me, though it's something I have once in a while kludged around where speed is not of the essence, by starting out with replacing most culprits in the match string with normal space prior to applying the regexp, i e (and this too most likely misses a few cases in above-latin-1-land) mystring = mystring.replace( /[\r\n]+/, ' ' ); -- so I can then write match patterns where . matches any character.

A much better solution is of course to craft a character class which does match any character. My first thought, [^], turned out to play well with Mozilla 1.5 but pretty much nowhere else (it probably breaks the ecmascript standard, though I didn't bother taking the time to verify), and my second try, [\s\S] (any whitespace or non-whitespace character), works nicely in Mozilla, Opera (9) and Internet Explorer (6) alike.

Next up, a reminder of what the "multiline" RegExp flag (/^foo$/m) does: it widens the semantics of ^ and $ to match not just start and end of string, but start and end of line too. Only that. Nothing else. (Really!)

Finally, the RegExp flag "global" (/lotsaplaces/g). Here be dragons! But first, a useful feature I wasn't aware of: String.prototype.match, when fed a global-match RegExp, returns an Array with all match occurrences in the string (the full match; no match groups). I have usually been messing with loops around RegExp.prototype.exec to get hold of these before, often for very little reason, figuring mystring.match( regexp ) and regexp.exec( mystring ) were doing identically the same. In the case of global RegExps, they don't; exec always gives you all the match groups, one match per call.

And here comes the pitfall. In order to be able to match later occurrences of a match on the same string when called repeatedly, the RegExp object's lastIndex property gets set to the next character position after the last found match (or zero, when no match was found). The next search, using that very same RegExp object, which remembers how far into the string it should start looking, will start looking there, on whatever string fed to it. Unless you start out by zeroing lastIndex before performing your exec() (or test()) call.

This probably rarely bites people who write wasteful code that never reuses previously instantiated objects. If you do choose to keep a regexp object around, though, it's easy to miss that it carries around stateful baggage from prior uses, and needs to be handled with care. I'll be sure to guard my methods that take RegExp parameters better from this kind of very hard to find bugs that might remain undetected too, for the sheer obscurity of it. Debug printouts (RegExp.prototype.toSource()) of the RegExp object doesn't mention the state of the lastIndex property, and if your test() for occurrences of the character "h" in "hi ho, hi ho, it's off to work we go" returns false once every five calls, it's likely to go unnoticed, or, at best, yield a bug report stating that "sometimes, something doesn't work here".

A contrived example, I agree, especially given that you wouldn't haphazardly throw in a global flag to test for something like this, but to add to confusion, these bugs tend to occur somewhere in the badlands between programmer A and programmer B, one of which typically wrote his code in popular javascript library C and suddenly the clash comes into play.

The cases where I believe I have run into bug due to this easy-to-miss fringe case is with really freaky huge regular expressions (kept cached) to parse out data that once in a while could choose to terminate early (before having iterated through the full match set, where lastIndex would be reset to zero automatically), and leave a non-pristine RegExp with state baggage from last run, missing early matches on the next run, on some other data set.

So be sure to respect the global flag, and whenever you use the test() or exec() methods on a RegExp object Not Instantiated Here (evil twin of the Not Invented Here rule), either start off zeroing its lastIndex property, or better still (so you don't wreck state for your caller) start off making a copy the passed RegExp for your own use -- new RegExp( passed_regexp ) (and don't forget the "new" keyword, or you get the same object back), which gets its own zeroed lastIndex.

It's in good functional style never to perform destructive operations on passed parameters (unless your method is all about destructive modification, such as to populate some passed object with data), and it's very easy to forget that using a RegExp for some testing and matching is one of these destructive operations. Wear that seat-belt, especially if you ever publicize or otherwise share your code with others. You, and many others, will be glad you did, all those times where horribly weird errors didn't occur and horribly useless bug reports didn't get filed. Thanks.

But you won't ever be given treats for your consideration. Here, have a hug from me, instead. I'll love you for it, anyway, and isn't that something too? ;-)

4 comments:

  1. Something that I've found replaces . well for if you need newline matching is [^\f] , which won't match the feed character.. but how often do you find that in a string?

    ReplyDelete
  2. Probably only when your script was crafted first, and somebody set out to be incompatible with it on purpose due to that slight opening. (Tinfoil-hatted user scripters beware! :-)

    Seriously, though, I'd pick [\s\S] (or [\w\W], if you like) over that just to make the semantical intent of the code obvious -- a character class explicitly including every character there is means just that, whereas any non-feed, non-NULL (\0) or similar character might hint at anticipated data you're proofing your code against. Regexps are often unreadable enough without additional red herrings strewn into them.

    My suggestion might not initially make much sense either, I'll grant you, but once you've understood it once, it's idiomatic, with only one meaning, if a few characters longer.

    ReplyDelete
  3. oi como você conseque colocar um calendário no site?

    Se for possível me diga como colocar.

    ReplyDelete
  4. você pode me dar o código do seu "Previous Entries"?
    quiaboman@hotmail.com

    ReplyDelete

Limited HTML (such as <b>, <i>, <a>) is supported. (All comments are moderated by me amd rel=nofollow gets added to links -- to deter and weed out monetized spam.)

I would prefer not to have to do this as much as you do. Comments straying too far off the post topic often lost due to attention dilution.

Note: Only a member of this blog may post a comment.