2007-02-17

Krugle finds code

Krugle is an already seriosly good contender to Google Code Search, for digging up the source code of that <whatever> it is you are looking for. I love the interface which lets you slash down to the particulars you want, if you want to, or not. With a slap-on license filter, it would beat Code Search hands down. (They seem to consider adding it already, judging by their online survey asking whether we want it or not, and how badly.)

Its creators apparently just partnered up with the Yahoo! Developer Network. This is exactly the tool I wish I had some day ago, when I dug up Dojo's getComputedStyle, to hack up a greased-weasel stand-alone getStyle which should be portable across browsers.
Categories:

2007-02-11

JSONP and Spreadsheets

A year has passed since I wrote about the merits of JSONP, and since then, services have slowly started using it, allowing you to use your data from web pages anywhere in contexts outside of the site of the services in question. Adoption is not very rapid, but every service that adopts it and does it well is a great win for the online community (and, I argue, the services themselves), since they are increasing the leverage of web programmers all over the world to a great extent.

Possibly even more so than javascript libraries and frameworks on offer; at the heart of JSONP lies a core of very lightly wrapped, highly concentrated raw data served in the most easily digestable format possible for a javascript programmer. It's just the matter of pointing a script tag at the data, and the data comes to your code, an HTTP request and a function call later. You just name the callback and get to do whatever it is you want to do with the data. (Yes. It really is that simple.)

The topic of this post will be spreadsheets, as spreadsheets is a very useful and low-barriers-to-entry way to access and maintain organizing structured data, the bread and wine of web applications, widgets, aggregation services and other tools and accessories you live and breathe on the web. Every one processes data of some sort, and most have a need to get it from somewhere. And I think online spreadsheets have lots of merit here, addressing a sore need in a very good manner.

No online spreadsheet that I am aware of supports JSONP natively today, though you can coax at least Google Spreadsheets and EditGrid into "sort of" supporting it. In the Google case you get valid and a letter correct JSONP feed, which is however not JSONP in spirit, whereas with EditGrid you get as good a spiritual match as you care for, but no URL configurable javascript callback name. So neither is really there yet, but I'll discuss how even this is already enough to be useful, after a fashion.

Google Spreadsheets

Let's start with Google, as there is less to say about that. Nowhere in their interface is there an actual JSONP (or JSON) option to pick, but you can ask its ATOM feed formatter to reformat its data as (surface) JSONP and it will happily comply. The question, however, when devising an export format, is always "how would my data best be expressed for its target domain?", which, in the case of JSONP, is javascript. And since javascript has multidimensional arrays, which is what a spreadsheet gives a visual face to, and ATOM does not, the outcome when you reformat the ATOM feed as JSONP is, shall we say, suboptimal.

You reformat a Google Spreadsheets ATOM feed to JSONP by appending the URL query parameter ?alt=json-in-script at the end of its URL, and, making it interactable client-configured JSONP, Google accepts the additional parameter callback, naming our callback, and supplies it with the gobbeldegok that came out of the transformation. The ATOM feed is limited to single sheet output, and given all of these constraints, here is what (reindention for humans by me; the original feed is even less legible -- not to mention the horrible URL) comes out of this example spreadsheet:
/Sheet1\      /Sheet2
A:1 B:1 C:1       b:1    
A:2 B:2 C:2   a:2 b:2 c:2
A:3 B:3 C:3   a:3     c:3
In summary, it chewed off A:1, drenched the place with huge id URLs, links, pointers to schemas and other junk invented out of nowhere, added some useful metadata about the file and munged together cells into not separately addressable entities (an example of which being row 2, which came out as "A:2" and "B:1: B:2, C:1: C:2"). Still, even given this gooey mess of things, fine hackers such as David Huynh manage to craft gems atop it. I again warmly recommend Exhibit. Expect upcoming and more in-depth articles about it here, as I have amassed a large backlog of great hacks with it recently. It is a very hackable visualization engine indeed.

EditGrid

EditGrid does lots of things right, and in great style. They even care about making your URLs look good. Their developer interfaces are paid equal amounts of love and attention as are their user interfaces, which in this case is great praise for both.

The data you commit to an EditGrid spreadsheet not just sits there, but bends to your will and comes out in whichever data format you care enough about to craft or look up a conversion template for -- which is done on the fly by applying XSLT templates server side to the XML format they export to. To finish it off, all URLs are legible, short and converting to a different format using either of the built-in defaults or your own custom formats is a simple matter of changing the extension. Here is our example, again, at http://www.editgrid.com/user/ecmanaut/example. The live URL allows anyone a spreadsheet interface to the documet, and the author and those he chooses to share editing priviges with, the option of editing it. Watch the other permalinks below and feel the warm fuzzies inside:

FormatOutput
XML.xml
HTML.html
CSV.csv
PDF.pdf
Excel.xls
OpenDocument.ods
OpenOffice 1.x.sxc
Gnumeric.gnumeric
TeX source.tex

I took some time to hack up a few useful variants on JSON and JSONP formats of the data for myself (and indeed others; take your picks -- and should you feel like extending them further, I recommend working off the last one, which plays best on the strengths of XSLT, .null.cells.json). It's not full-blown JSONP, of course; you can't configure the callback name from the URL the way you should be able to (which becomes useful the instant you import more than one spreadsheet), but this is as close as we get without the interactive step.

I have organized the formats and the XSL templates by their properties; a few are standards compliant JSONP, following the JSON definition, some are even more size conservative and employ the short form used by javascript to define sparse arrays where there are empty cells. Those that are legal and interoperable JSON(P) (not all JSON consumers are javascript!) are tinted greenish in the table, the rest slightly rougeishly blushing.

Empty cells become ""Empty cells omitted
Plain JSON.xsl .json
{"workbook":[
{/*...*/},
{"name":"Sheet2","sheet":[
[{},{"type":"string","input":"b:1","value":"b:1","content":"b:1"},{}],
/*...*/]}
}
.xsl .null.json
{"workbook":[
{[/*...*/]},
{"name":"Sheet2","sheet":[
[,{"type":"string","input":"b:1","value":"b:1","content":"b:1"},],
/*...*/]}
}
JSONP, data only.xsl .data.jsonp
editgridCallback({"workbook":[
{"name":"Sheet1","sheet":[
["A:1","B:1","C:1"],
["A:2","B:2","C:2"],
["A:3","B:3","C:3"]]},
{"name":"Sheet2","sheet":[
["","b:1",""],
["a:2","b:2","c:2"],
["a:3","","c:3"]]}
})
.xsl .null.data.jsonp
editgridCallback({"workbook":[
{"name":"Sheet1","sheet":[
["A:1","B:1","C:1"],
["A:2","B:2","C:2"],
["A:3","B:3","C:3"]]},
{"name":"Sheet2","sheet":[
[,"b:1",],
["a:2","b:2","c:2"],
["a:3",,"c:3"]]}
})
JSONP, all properties.xsl .cells.jsonp
editgridCallback({"workbook":[
{/*...*/},
{"name":"Sheet2","sheet":[
[{"type":"","input":"","value":"","content":""},/*...*/],
[{"type":"string","input":"a:2","value":"a:2","content":"a:2"},/*...*/],
[/*...*/]]}
})
.xsl .null.cells.jsonp
editgridCallback({"workbook":[
{/*...*/},
{"name":"Sheet2","sheet":[
[,{"type":"string","input":"b:1","value":"b:1","content":"b:1"},],
/*...*/]}
})
Exhibit JSONP.xsl .exhibit.jsonp
editgridCallback({"items":[
/*...*/
{"type":"Sheet2","":"a:3","b:1":"","":"c:3"}]
})
.xsl .null.exhibit.jsonp
editgridCallback({"items":[
/*...*/
{"type":"Sheet2","":"a:3","":"c:3"}]
})

I'll get back about how to use those Exhibit JSONP to do great things. This post is mostly for reducing my backlog about things to blog to get things out where people can play with and work their own magic with them.

2007-02-06

Full list of web pages that link yours

Delivered by whom other than Google?

If you haven't already been wowed by latest news from Google, check out their webmaster tools. The full list of inbound links to all pages on your site, now downloadable as a CSV file listing your page, the linking page, and the last time the Google crawler found the link there.

Yep; they did it again. It's actually two priceless tools in one: the full registry of links on the web that point your way from external sites, and the full registry of links on your own site that point across it, in both cases listing both endpoints of the links.

Or browse them in their online interface, indexed on URL, listing the number of inbound links.

Update:

It is of course more fun to browse the data set from the comfort of your own local database (mysql prompt?); I swept up a quick pike hack to import it into one, so I could run queries on the data set like
SELECT COUNT(*) AS N,site,url FROM inbound
GROUP BY site ORDER BY N DESC LIMIT 25;
SELECT COUNT(*) AS N,site,url FROM inbound WHERE path!=""
GROUP BY site ORDER BY N DESC LIMIT 25;

to get top lists of sites linking my blog, and specific content on it respectively. And much fun was had. Set up a database (it picks "inbound", if you don't alter the script's header) and feed the script your csv file. (You might want to drop the inbound table if you do a later reimport; the script hasn't evolved into anything but splitting up the data on useful fields yet.)
Categories:

2007-02-04

(DOM)Node.textContent and Node.innerText

You can safely skip the next two paragraphs if you're in a hurry; it's just warm air.

I'm the first to admit I thrive in Mozilla/Firefox land, but would defect in an instant if I could bring Firebug, AdBlock, the Filterset.G updater and the Greasemonkey (well, its user interface and privileged API methods; the rest of the user script architecture in Opera beats any other web browser, hands down) with me, immigrating into Opera land. (Then I would spend the next few years peeking at the neighbour's green grass, wishing I could bring with me the best features from that environment to the next. But software unfortunately doesn't really work that way.)

Why? Social and emotional reasons mostly. Having closer and better connections with the Opera dev team (that's got to be a first), and seeing how fervently they profile, chip off overhead and beat their code into a Japanese razor sharp blade, with an staggering eye towards standards compliance and lightning speed. See? It's all a lot of emotional goo about an inflated feeling based on knowing the people who work there and more than occasionally hearing about how they spend their time.

Amusingly, I wrote up this post in a somewhat too tired state to take note of what the W3C Node.textContent DOM 3 specifications actually said, believing Firefox had it wrong and Opera current did it right. It's the other way around, fortunately.

Here is how Node.textContent works in Firefox: you have a node, read its textContent property, and out you get all its text contents, stripped of all tags and comments. A <tt>Some <!-- commented-out! --> text</tt> tag would thus give you the string Some  text. (Your browser, identifying itself as ".)

The Firefox behaviour is very useful. I love it. But it is unfortunately not how the W3C defined textContent to behave. Don't ask me why. The standards compliant correct result would be Some commented-out! text.

IE6 and IE7 do not implement Node.textContent at all, but have their own Node.innerText, which predates the DOM, and behaves the same way (barring whitespace differences -- those two space characters in the middle actually end up a single one).

Opera implements both, the way each was presumably defined neither actually presently quite on target. :-) As MSDN does not really define very well what innerText does, though, Opera actually implements innerText the way the Firefox textContent works.

Firefox only implements Node.textContent, gets it wrong right, and ended up implementing a useful behaviour insteadeed. If Firefox eventually decide on fixing this bug (I will not urge them to hurry; indeed I don't think an issue has even been filed yet -- and the present behaviour has been with us for as long as I have known it, anyway), I really hope they would consider delegating the present behaviour to innerText, instead, as does Opera.

Safari implements neither (innerText returns the empty string, though!) but amusingly has an innerHTML property which behaves the way Firefox's textContent does. (Yay Safari! ;-)

All the above concerns the behaviour of the getters of mentioned properties, which is the bit I have most interest in myself, for the most part. It's a great way to scrape data free of markup from pages, client side and at minimum effort, for instance in user scripts. I do this a lot.

Fortunately, for myself, I still mostly write user scripts for my personal needs, so it does not really matter that there is still a ravaging non-consensus war about the BOM (the browser object model) going on out there, even with the W3C trying to make them all agree about something. Some days, like when the W3C event bubble / trickle model was designed, for better, some days, like when they got textContent wrong, for worse.

Are there any ambitious people out there who have set up automated BOM test suites running around the ins and outs of the BOM of its visitors, collecting the results and presenting a continuously updated index of their findings? I would love to chip in some money to a good project like that. And if there aren't, here is an excellent opportunity for web fame and recognition for someone. I wouldn't mind mentoring it.

2007-02-03

Medieval Mailing List Software in the Web Age

I'm surprised that I haven't seen any widely deployed mailing list service list permalinks in their mail bodies. All the big players in this field add links to all mail on the list, to a lesser or greater extent, and they all keep a permanent record of all mails on the web. What is more, on almost all mailing lists, at least in the technical field, posters frequently have plenty reason to refer to prior discussions.

But none of them list permalinks. Not one! Whyever is that?

Actually, to be fair, Yahoo! Groups sort of does. One of the nineteen (19) links they spray every message with, four of which are mailto: links, one terms of service, and the other fourteen, unshaven filthy gobs of base64 encoded data identifying you, starting at 150 characters long and moving on up to 216 (in my sample -- a number which incidentally rings familiar as the largest length of a URL supported by some archaic mac or windows browser, if I'm not mistaken), the first of which, after an HTTP redirect (registering your click and washing away all the junk again) actually drops you off at a permalink.

Google Groups has three links; the address you mail your posts to, the address you mail to unsubscribe, and a link to the web interface for the list, where the really ambitious helpful kind of person can, you guessed it, search for a thread to find its permalink. Points for minimum cruft, but somewhat behind the times, I think.

Jeremy Dunck recently blessed the Greasemonkey list (and indeed any Google Group) with a user script hijacking the GMail interface to provide a shortcut search link to the thread you're reading, in Gmail - Find Thread in Google Groups.

That brings Google on par with Yahoo after some community support plastic padding for those who hang around the right bars and are in on the right buzz and use the right tools and services.

But nowhere on the web are mailing lists with permalinks. In 2007. I find that kind of fascinating.