2006-01-07

Seventh druid, and a nifty XPath tool

Yesterday I spent a few off hours wandering the web, with an attentive eye to permalink structure and post HTML formation. Why? Off the top of my hood, well... It seemed like more fun being a druid than a mere cleric. I am not sure it was time well spent, but it was educational, and a peek at an aspect of the web I think we seldom take much note of as site visitors. My selection of sites to peek more closely at was (more or less) the sites on my OPML list of feeds I read; a cross-section of javascript, greasemonkey and web 2 tech people mostly, with the odd graphics designer and somewhat more mindful read sprinkled on top.

Unsurprisingly, mostly all of them employ modern HTML constructs for site layout, with <div> tags and semantical markup for headers, lists and the like to draw up the general structure of pages, rather than the flurry of tables, font tags and spacer images of the primeval web. I believe we have the relative maturity of CSS and the hard work of template designers prior to the recent blogging explosion to thank for this. (And thank deity for that -- myself, I practically left the dirty web for a few years back in the nineties, having wanted to do many of the things that required the much cleaner web of today, and capabilities unavailable in javascript and the DOM back then.)

Anyway, to my surprise, one of the sites that was a mix and match of old bad times and fresh good times, was Joel on Software. This is a hand wrought site by a programmer for other programmers (mostly), and it employs table and tag attribute mayhem for base site structure, and occasionally classed divs for some of the content grouping inside. Not the worst tag soup design of late, but not as pretty as I had expected either.

I know, this is all not very interesting, but it gave me some good XPath exercise, trying to pick out specific nodes of the pages I was interested in. The kind of thing like "find the last <p> child of the first <td> element which has a <div class='slug'> child" (//td[div[@class='slug']]/p[last()]), which refreshes a lot more XPath expertise than a trivial "find the <div id='viewer'> element" (//div[@id='viewer']). Granted, neither of the above ensure that they do not match more than one element (the second will, in a well-formed document), but for my purposes, that was not relevant.

And, better still, it gave birth to a little scriptlet for trying out XPath expressions on a web page, flashing the first element matched (or bringing up the expression again with an error message if it failed to find the node, or you wrote some malformed XPath). For those of you familiar with the Firefox Document Inspector, the behaviour is familiar. Something like this ought to go into its "find" mode, by the way; I place this code in the public domain, should anyone want to submit it upstream.

By default, it suggests an expression matching divs with an empty class attribute, a very common start of most of the kind of things I usually look for; typically I add a "post-body" or some other name between the apostrophes. I suggest typing in the expression you want, saving it in your clipboard (ctrl+C) so you can use it once you saw that it did what you wanted. Go ahead and try that on this page, if you like; it should flash the text body of this post a few times.

To go in the other direction, that is, on an unfamiliar page, how would one go about finding an XPath for a particular section of the page (assuming you are familiar with XPath syntax and workings), I again (read my prior post about it) warmly recommend Aardvark, a great extension which shows node names, classes and id:s of things hovered by the mouse, once invoked. You can even walk up through page structure by repeatedly tapping W (to widen scope). Immensely useful. And, for tough nuts such as Joel -- the Document Inspector, for looking at the full exact node structure of the surroundings of a node.

Having been made the seventh druid of the hoodwink society, I'm starting to feel right at home. I'll be back with some additional thoughts about permalinks shortly, which I feel ought to be given a whole article of its own. Permalinks, the remnants of the original idea of the URL, is very important technology we ought to pay much more attention to, and teach every new generation of people coming to the web about. But I will save that discussion for later.


By the way, if you find some article or tool of mine really useful, I would very much appreciate a small tip, say a dollar, for my work on it. Try giving my donation pane a spin; I try to keep it unobtrusive and out of the way so it does not disturb the readability of my posts. Don't feel obliged to, but it would encourage me keep doing the kind of things others (besides myself) find value in.
blog comments powered by Disqus