2006-03-05

E4X and the DOM

I haven't covered E4X (short for ECMAScript for XML; ECMA-357 specification) much here yet, but I have been experimenting with it for a while now in Greasemonkey scripts, and got to a point where I feel I have some findings to contribute.

I'm not going to go into details about the splendour of the E4X design nor explain basic concepts; Jon Udell gave a short introduction in September 2004, and I'll be referring to a few other good articles about it later in this post too. Put short, though, it makes XML nodes or trees first class objects (just like numbers, strings or RegExps), using XML as the literal syntax, and adds terse, readable and expressive syntax to perform various slicing and dicing operations on these objects sharing many common traits with XPath.

On a ranty side note, it's what the DOM APIs should have been in the first place, had they not been plagued by Javaisms such as naming the most basic and frequently used method document.getElementById, for a whopping 23 letter name. People who write I18N and L10N for Internationalization and Localization should not use the DOM. (Or set themselves up with d21d(), d27e(), and so on, aliases.) John Schneier makes a less emotional comparison between XSLT, DOM and E4X in Native XML Scripting with E4X, proceedings of the XML 2005 conference and sums up his conclusions in corporate speak too at the end. Again, put short: E4X is about productivity and readability.

As those of you who have been following the E4X field might know, there has been some support for it in Firefox for quite a while now (Kurt Cagle describes some basics in June, 2005), and returns in a later presentation, Advanced Javascript (subtitled "E4X in Firefox 1.5"), of which I'd like to quote the killer misfeature of the current state of affairs:

Object created is NOT a DOM Node, but an E4X node.

Which means that while E4X nodes are first class objects, that doesn't mean you can pass them to the DOM APIs; no node.appendChild( <img src={url}/> ) yet. (But had it worked, that code would have been a drop-in replacement for var img = document.createElement('img'); img.src = url; node.appendChild( img ); -- expressive indeed!) ...So while you can do lots of really nifty XML operations without resorting to messing with XPath through clunky DOM APIs, before you inject the results anywhere, it's falling back to the old and ugly node.innerHTML = e4x.toXMLString() injecting by string representation. Eww.

Or maybe not.

I sent out a plea for help to the Greasemonkey list, and some time later encountered a resourceful post by Mor Roses, where he tossed up an importNode method that translates E4X nodes to DOM nodes for a specific document object. Here is my take on it:
function importNode( e4x, doc )
{
var me = importNode, xhtml, domTree, importMe;
me.Const = me.Const || { mimeType: 'text/xml' };
me.Static = me.Static || {};
me.Static.parser = me.Static.parser || new DOMParser;
xhtml = <testing xmlns="http://www.w3.org/1999/xhtml" />;
xhtml.test = e4x;
domTree = me.Static.parser.parseFromString( xhtml.toXMLString(),
me.Const.mimeType );
importMe = domTree.documentElement.firstChild;
while( importMe && importMe.nodeType != 1 )
importMe = importMe.nextSibling;
if( !doc ) doc = document;
return importMe ? doc.importNode( importMe, true ) : null;
}
To make it more pragmatically useful, I tossed up two helper methods, appendTo and setContent, both of which take an E4X structure and a target node parameters, and injects your XML at the end of the node. The latter method, in addition, starts by removing any prior contents of the node:
function appendTo( e4x, node, doc )
{
return node.appendChild( importNode( e4x, doc || node.ownerDocument ) );
}

function setContent( e4x, node )
{
while( node.firstChild )
node.removeChild( node.firstChild );
appendTo( e4x, node );
}
So it's not node.appendChild( <img src={url}/> ), but appendTo( <img src={url}/>, node ). (Prototype fans may of course opt to add these methods to Node.prototype instead, laughing potential naming collisions with external libraries in the face, that aspect being an inherent feature or plague of the language design.)

For a real-world code example, I'm making extensive use of this in my recent Mark my links tool (version 1.7 source code).

Greasemonkey script writers out there might want to know that it is not a perfect translation, though useful for most purposes -- the tagName property of the resulting nodes are not upper case, the way they for some reason are in HTML documents, so if you want to play under the radar of target page code looking for i e IMG elements, you might need to perform some additional trickery. I just kludged a case of that using unsafeWindow.Image.prototype.__defineGetter__( 'tagName', function(){ return 'IMG' } ) -- I'm sure there are nicer ways too.

For some reason, I don't see any reports of parse errors in scripts where the E4X literals contain malformed XML though, but rather get plain non-functional scripts, which seriously hurts debugging. I have yet to find out whether it's due to some flaw of Mozilla core, Greasemonkey or my local firefox installation. Somehow I suspect the latter most; let's hope I'm right about that.

7 comments:

  1. wow. "misfeature" is being too kind here!

    I almost stopped reading when I discovered that you still have to "manually" parse the E4X nodes into DOM nodes...

    Your helper functions are great, readability can be maintained at the cost of dragging around libraries, but I would think that one other benefit of E4X would be to improve parsing time (instead of heaving to do it in JS).

    Hopefully this gets fixed sometime ...

    ReplyDelete
  2. Hi Johan :)
    Sorry for the delay. I answered to your request :-)

    Ah, thanks! I'll be sure to look into it; coComment integration is a neat concept.

    ReplyDelete
  3. [...] I would think that one other benefit of E4X would be to improve parsing time (instead of heaving to do it in JS).

    Hopefully this gets fixed sometime ...


    I'm sure it will, and I agree; from what I gather, it's just not come very far in Mozilla yet, integration wise. The present state is a lot better than no E4X at all, though; it's already a very useful tool.

    ReplyDelete
  4. See Appendix A of the E4X spec. It describes an optional method domNode() which converts and E4X XML object into the equivalent DOM node. And if I recall correctly the XML() constructor is supposed to be able to take a DOM node as an argument.

    I know that Brendan Eich wanted to get these features in in time for Firefox 1.5, but didn't make it. I don't know if he's continued to work on it.

    ReplyDelete
  5. Development has all but ceased:
    https://bugzilla.mozilla.org/show_bug.cgi?id=270553

    Brenden apparently doesn't think it's a good idea:
    http://groups.google.com/group/mozilla.dev.tech.js-engine/browse_thread/thread/6566b430328bc3ef


    That really blows, since it'd be a marvelous thing. Really, what they could do is create a Mozilla extension type called XMLLink or some such thing. Do this:

    var x = new XMLLink(document)
    or
    var x = new XMLLink(document.node)

    then x.foo = bar

    And since x is then a pointer to document.node (or whatever) instead of a deep copy as it would be using new XML(document), it will automatically perform as it should, for instance setting the attribute foo to bar.

    ReplyDelete
  6. Why not just:

    function importNode(e4x, doc) {
        var div = (doc || document).createElement('div');
        div.innerHTML = e4x.toXMLString();
        return div.removeChild(div.firstChild);
    }

    I did some testing and for small to medium blocks of html (i.e. dozens of nodes, not hundreds) it is faster to use this method. For very large blocks of html the function you posted is faster, but I think we rarely generate huge blocks on the client side.

    ReplyDelete
  7. That would munge some tags (typically html, head and body). It will often work, but adds additional leakiness to the abstraction, which adds nasty overhead to debugging when that hack silently didn't do what you thought it did.

    ReplyDelete

Limited HTML (such as <b>, <i>, <a>) is supported. (All comments are moderated by me amd rel=nofollow gets added to links -- to deter and weed out monetized spam.)

I would prefer not to have to do this as much as you do. Comments straying too far off the post topic often lost due to attention dilution.

Note: Only a member of this blog may post a comment.