2007-02-04

(DOM)Node.textContent and Node.innerText

You can safely skip the next two paragraphs if you're in a hurry; it's just warm air.

I'm the first to admit I thrive in Mozilla/Firefox land, but would defect in an instant if I could bring Firebug, AdBlock, the Filterset.G updater and the Greasemonkey (well, its user interface and privileged API methods; the rest of the user script architecture in Opera beats any other web browser, hands down) with me, immigrating into Opera land. (Then I would spend the next few years peeking at the neighbour's green grass, wishing I could bring with me the best features from that environment to the next. But software unfortunately doesn't really work that way.)

Why? Social and emotional reasons mostly. Having closer and better connections with the Opera dev team (that's got to be a first), and seeing how fervently they profile, chip off overhead and beat their code into a Japanese razor sharp blade, with an staggering eye towards standards compliance and lightning speed. See? It's all a lot of emotional goo about an inflated feeling based on knowing the people who work there and more than occasionally hearing about how they spend their time.

Amusingly, I wrote up this post in a somewhat too tired state to take note of what the W3C Node.textContent DOM 3 specifications actually said, believing Firefox had it wrong and Opera current did it right. It's the other way around, fortunately.

Here is how Node.textContent works in Firefox: you have a node, read its textContent property, and out you get all its text contents, stripped of all tags and comments. A <tt>Some <!-- commented-out! --> text</tt> tag would thus give you the string Some  text. (Your browser, identifying itself as ".)

The Firefox behaviour is very useful. I love it. But it is unfortunately not how the W3C defined textContent to behave. Don't ask me why. The standards compliant correct result would be Some commented-out! text.

IE6 and IE7 do not implement Node.textContent at all, but have their own Node.innerText, which predates the DOM, and behaves the same way (barring whitespace differences -- those two space characters in the middle actually end up a single one).

Opera implements both, the way each was presumably defined neither actually presently quite on target. :-) As MSDN does not really define very well what innerText does, though, Opera actually implements innerText the way the Firefox textContent works.

Firefox only implements Node.textContent, gets it wrong right, and ended up implementing a useful behaviour insteadeed. If Firefox eventually decide on fixing this bug (I will not urge them to hurry; indeed I don't think an issue has even been filed yet -- and the present behaviour has been with us for as long as I have known it, anyway), I really hope they would consider delegating the present behaviour to innerText, instead, as does Opera.

Safari implements neither (innerText returns the empty string, though!) but amusingly has an innerHTML property which behaves the way Firefox's textContent does. (Yay Safari! ;-)

All the above concerns the behaviour of the getters of mentioned properties, which is the bit I have most interest in myself, for the most part. It's a great way to scrape data free of markup from pages, client side and at minimum effort, for instance in user scripts. I do this a lot.

Fortunately, for myself, I still mostly write user scripts for my personal needs, so it does not really matter that there is still a ravaging non-consensus war about the BOM (the browser object model) going on out there, even with the W3C trying to make them all agree about something. Some days, like when the W3C event bubble / trickle model was designed, for better, some days, like when they got textContent wrong, for worse.

Are there any ambitious people out there who have set up automated BOM test suites running around the ins and outs of the BOM of its visitors, collecting the results and presenting a continuously updated index of their findings? I would love to chip in some money to a good project like that. And if there aren't, here is an excellent opportunity for web fame and recognition for someone. I wouldn't mind mentoring it.

5 comments:

  1. The results of textContent in Mozilla seems to be the same as doing this:

    var node = document.getElementById("myNode");
    var rng = document.createRange();
    rng.selectNode(node);
    alert(rng.toString());

    ReplyDelete
  2. I think you were right that firefox gets it wrong.

    The spec at http://www.w3.org/TR/DOM-Level-3-Core/core.html#Node3-textContent says to return the text content of this node and its descendants, for each node depending on its type. For COMMENT_NODE that is its nodeVaule and that is the characters enclosed in the <!-- --> statement.

    So, what made you change your mind?

    ReplyDelete
  3. That the table below that paragraph reads "concatenation of the textContent attribute value of every child node, excluding COMMENT_NODE and PROCESSING_INSTRUCTION_NODE nodes". So it is defined in a sane manner, but you won't notice it until you have read the specs with the keenest attention to detail. No wonder browser makers gets it wrong the first try.

    ReplyDelete
  4. The annoying thing is the total lack of textContent at all in IE :P I've been using this:

    //get node to var node
    try { if(!node.innerText) node.innerText = node.textContent; } catch(e) {}
    //use node.innerText

    Since it's the only way of doing it I've found that most browsers are okay with and doesn't kill IE.

    Of course, one could probably run a tag-removal RegExp on node.innerHTML, but that's not very pretty...

    ReplyDelete
  5. I just tried safari 3.2 in Windows and it is working.

    ReplyDelete

Limited HTML (such as <b>, <i>, <a>) is supported. (All comments are moderated by me amd rel=nofollow gets added to links -- to deter and weed out monetized spam.)

I would prefer not to have to do this as much as you do. Comments straying too far off the post topic often lost due to attention dilution.

Note: Only a member of this blog may post a comment.