2006-09-18

RegExp peculiarities and pitfalls

I've received my copy of JavaScript: The Definitive Guide, 5th edition now, and been reading up on things. As expected, it's good. It's really good. And it's been ages since I read much of this, and much of what I pick up this time around I think I must have missed in prior editions, perhaps for no longer being on the same (lower) knowledge level as I used to be (back in the nineties, when I, too, considered javascript a toy language not worth learning in depth).

Anyway, figuring my refresher might refresh or bring useful news to the attention of others, too, I thought I'd share some aha! moments of "Ooh, that's useful!" or "Ouch, that's dangerous!" (The latter probably explaining some rather weird, very difficult to find and work around bugs I've met in my day.)

First things first. The basics: the RegExp . (literal /./), doesn't match newline (generalized in the Unicode sense) characters. This wasn't much news to me, though it's something I have once in a while kludged around where speed is not of the essence, by starting out with replacing most culprits in the match string with normal space prior to applying the regexp, i e (and this too most likely misses a few cases in above-latin-1-land) mystring = mystring.replace( /[\r\n]+/, ' ' ); -- so I can then write match patterns where . matches any character.

A much better solution is of course to craft a character class which does match any character. My first thought, [^], turned out to play well with Mozilla 1.5 but pretty much nowhere else (it probably breaks the ecmascript standard, though I didn't bother taking the time to verify), and my second try, [\s\S] (any whitespace or non-whitespace character), works nicely in Mozilla, Opera (9) and Internet Explorer (6) alike.

Next up, a reminder of what the "multiline" RegExp flag (/^foo$/m) does: it widens the semantics of ^ and $ to match not just start and end of string, but start and end of line too. Only that. Nothing else. (Really!)

Finally, the RegExp flag "global" (/lotsaplaces/g). Here be dragons! But first, a useful feature I wasn't aware of: String.prototype.match, when fed a global-match RegExp, returns an Array with all match occurrences in the string (the full match; no match groups). I have usually been messing with loops around RegExp.prototype.exec to get hold of these before, often for very little reason, figuring mystring.match( regexp ) and regexp.exec( mystring ) were doing identically the same. In the case of global RegExps, they don't; exec always gives you all the match groups, one match per call.

And here comes the pitfall. In order to be able to match later occurrences of a match on the same string when called repeatedly, the RegExp object's lastIndex property gets set to the next character position after the last found match (or zero, when no match was found). The next search, using that very same RegExp object, which remembers how far into the string it should start looking, will start looking there, on whatever string fed to it. Unless you start out by zeroing lastIndex before performing your exec() (or test()) call.

This probably rarely bites people who write wasteful code that never reuses previously instantiated objects. If you do choose to keep a regexp object around, though, it's easy to miss that it carries around stateful baggage from prior uses, and needs to be handled with care. I'll be sure to guard my methods that take RegExp parameters better from this kind of very hard to find bugs that might remain undetected too, for the sheer obscurity of it. Debug printouts (RegExp.prototype.toSource()) of the RegExp object doesn't mention the state of the lastIndex property, and if your test() for occurrences of the character "h" in "hi ho, hi ho, it's off to work we go" returns false once every five calls, it's likely to go unnoticed, or, at best, yield a bug report stating that "sometimes, something doesn't work here".

A contrived example, I agree, especially given that you wouldn't haphazardly throw in a global flag to test for something like this, but to add to confusion, these bugs tend to occur somewhere in the badlands between programmer A and programmer B, one of which typically wrote his code in popular javascript library C and suddenly the clash comes into play.

The cases where I believe I have run into bug due to this easy-to-miss fringe case is with really freaky huge regular expressions (kept cached) to parse out data that once in a while could choose to terminate early (before having iterated through the full match set, where lastIndex would be reset to zero automatically), and leave a non-pristine RegExp with state baggage from last run, missing early matches on the next run, on some other data set.

So be sure to respect the global flag, and whenever you use the test() or exec() methods on a RegExp object Not Instantiated Here (evil twin of the Not Invented Here rule), either start off zeroing its lastIndex property, or better still (so you don't wreck state for your caller) start off making a copy the passed RegExp for your own use -- new RegExp( passed_regexp ) (and don't forget the "new" keyword, or you get the same object back), which gets its own zeroed lastIndex.

It's in good functional style never to perform destructive operations on passed parameters (unless your method is all about destructive modification, such as to populate some passed object with data), and it's very easy to forget that using a RegExp for some testing and matching is one of these destructive operations. Wear that seat-belt, especially if you ever publicize or otherwise share your code with others. You, and many others, will be glad you did, all those times where horribly weird errors didn't occur and horribly useless bug reports didn't get filed. Thanks.

But you won't ever be given treats for your consideration. Here, have a hug from me, instead. I'll love you for it, anyway, and isn't that something too? ;-)

2006-09-01

Blogger beta templates

I got curious about the Blogger Beta templates and had a peek of my own. They still merit the "Beta" tag, but a lot is already in a working or partially working state. Maybe even most of it, though I sense a still partially non-exposed aspect of their widget system, which would be a lot of fun to play with once released. Conceptually, it's a system similar to RXML or CFML; a markup language for server-side composition rendered dynamically with every page-load. Blogger's new markup language seems to be called GML (Google Markup Language, most likely), judging by the namespace markers.

It's probably similar to other blog platform widget markup languages, but I do not have that frame of reference. A few years as a developer of and technical writer on Roxen WebServer yielded me some high profile knowledge on the making of template languages, though, and there are a few perls to pick in Blogger's new template engine.

Some concept docs are already in place about the page elements; <b:section> tags in your template mark places where widgets can be placed in the page flow and <b:widget> tags (placable inside the former) define the part of each widget that gets rendered into the web page (server side) on the spot where the tag was encountered in the template. (Presently the pageType attribute doesn't seem to be handled as documented -- at least any I enter are swallowed, ignored and wiped away the next time I load the template.)

To me, what we see so far is about a quarter of to maybe half of what constitutes this widget engine. Each widget also has a configuration view, storage backend and data object model, all tying into one another, and the rendered page as a whole has a document object model. We don't have docs about or access to either yet, but I'd love to eventually see information about this Blogger DOM. With luck that is what we will see in the detailed widget tags docs, once they show up. With even more luck, we will eventually also see and be able to make and share widgets of our own, complete with configuration views and server-side persisted data. (Here's hoping, anyway.)

This post will enter even more speculative grounds from here, as I'm just theorizing around my findings about the mechanics of this template system now -- there is little trial-and-error empirical evidence backing it, so expect flaws where my intuition was not compatible with those of the Blogger template engineers'. All tags mentioned may (and mostly do) contain additional tags, unless specified otherwise. I'm addressing a programmer audience below, assuming some familiarity with variables, flow control and XML markup.

Data types

The data widgets handle comes in several varieties. First, there are the scalars -- strings, maybe (or at least conceptually) integers (perhaps just string-to-integer coercion rules), enums (pick-a-string-from-a-given-list) and booleans (the enum "true" / "false"). Second, there are the compound types; objects and collections of objects. All data coexist in a server-side object model operated on by the new Blogger tags and custom-namespace expr:* attributes that you may use with regular HTML tags too.

data: tags

Scalars are available for insertion directly into the document by way of the tags in the data:* namespace. Just name the scalar and wrap it in a tag and its value gets injected into the document, i e <data:blog.pageType/> would yield index for the root (or a label lookup) page, archive when on an archive page and item for a post page. The complete layout of the Blogger DOM is not addressed by this post, but keen researchers are encouraged to link back to this post from their references and show up in backlinks here, for the benefit of all readers (including myself :-), until Blogger takes its time to publish one.

expr: attributes

As XML documents don't allow tags inside tag attributes, the expr:* attribute namespace, which you can use with any HTML tag, lets you expand data there too in a similar fashion. If, say, you wanted to expand the data:blog.homepageUrl value into the href attribute of a link tag, you would write that as <a expr:href='data:blog.homepageUrl'> -- just tuck on the attribute name you would have used after the expr: prefix. If what you wanted in the attribute was a combination of a variable and something else, expr:* attributes allow a certain amount of flexibility via string concatenation, such as expr:href='data:post.url + "#comment-" + data:comment.id' (note the different types of quotation used; apostrophes to encase the full XML attribute and quotation marks to hold the string literal).

The rest of the tags

<b:includable>

Moving on and in another level from the <b:widget> tag, we see one or more <b:includable> tags. You might compare these with the <xsl:template> rules of an XSLT template; they may invoke one another, passing along parameters (one parameter, anyway).

id (mandatory)

This is the (widget unique) name, by which the template rule is invoked from elsewhere. Every widget tag has an includable with an id attribute "main", which is what gets run when the widget is rendered into HTML form.

for (optional)

The for attribute names the incoming parameter for the variable scope visible within the includable, overshadowing data under the same name in the caller's scope. Not all includables accept parameters. In the absence of a for parameter (when a variable was passed), the name defaults to data. The type of the parameter passed to the main template is given by the type attribute of the <b:widget> tag, and its exposed object model varies accordingly.

<b:include>

This tag invokes another (or possibly the same, if recursion is supported) includable within the same widget definition.

name (mandatory)

The name parameter corresponds to the id parameter of the <b:includable> you wish to invoke.

data (optional)

The data parameter selects what data you wish to pass along to the <b:includable> you are invoking (as named in the current variable scope), if any.

<b:loop>

This construct is the iteration clause, that lets you loop over a collection, repeatedly applying the contained markup block once for every item in the collection.

values (mandatory)

This attribute names the collection to iterate over. For instance, you could loop over every label using data:labels for a type="Label" widget.

var (mandatory)

In the scope inside of this tag, this names the variable holding the object for the present loop iteration. It is the logical equivalent of the <b:includable> tag's for attribute. In other words, if we supply a var="comment" parameter and the objects in the collection have a body property, we can print that using <data:comment.body/>.

<b:if>

This tag is the conditional, picking one of two different outputs. In the absence of an <b:else/> branch, it either outputs its full contents or nothing at all, depending on the conditional. With an <b:else/> tag present inside, it picks the part before or afterwards depending on the result of the test.

cond (mandatory)

This attribute lists the condition to test on. It may be just a variable name (i e cond='data:post.allowComments', to test a boolean's truth value), or an expression, such as cond='data:blog.pageType == "item"' (again note the use of alternating types of quotation; one for the XML attribute encasing, another for the string literal). These expressions may use the operators ==, for testing for equality and != for testing for non-equality.

<b:else/>

This tag may not have any content, and it may only appear inside an <b:if> tag. It splits the contents of its parent in two, the first being the branch to execute when the conditional was met, and the second when it was not.