2007-12-25

Code smaller

Ladies and gentlemen of the class of '07: code smaller!

If I could offer you only one tip for the future, slim code would be it.

The long-term benefits of small code bases have been proved by scientists, whereas the rest of my advice has no basis more reliable than my own meandering experience.

Enjoy the power and beauty of your language.

Oh, never mind. You will not understand the power and beauty of your language until you have been forced into adopting a less powerful and beautiful language.

But trust me, in 20 years, you'll look back at and recall in a way you can't grasp now how much possibility lay before you and how fabulous that language really was.

You do not have to write as bloated software as you imagine.

Don't worry about the future.

Or do, but know that trying to design future-proof software is as effective as trying to solve an algebra equation by chewing bubble gum.

The real troubles in your code are apt to be things that never crossed your worried mind, the kind that blindside you at 4 am on some idle Tuesday.

Do one thing every day that scares you.

Think.

Don't be reckless with other people's projects.

Don't put up with people who are reckless with yours.

Refactor.

Don't waste your time on jealousy.

Sometimes you're ahead, sometimes you're behind.

The race is long and, in the end, it's only with yourself.

Remember compliments you receive. Forget the insults.

If you succeed in doing this, tell me how.

Keep your old fan mail. Throw away your old time reports.

Eat.

Don't feel guilty if you don't know what you want to do with your life.

The most interesting people I know didn't know at 22 what they wanted to do with their lives.

Some of the most interesting 40-year-olds I know still don't.

Get plenty of sleep. Be kind to your wrists. You'll miss them when they're gone.

Maybe you'll gain a following, maybe you won't.

Maybe you'll spawn sub projects, maybe you won't.

Maybe you'll leave the field at 40, maybe you'll release The Next Big Thing on your 75th wedding anniversary.

Whatever you do, don't congratulate yourself too much, or berate yourself either.

Your choices are half chance. So are everybody else's.

Enjoy your craft.

Use it every way you can. Don't be afraid of it or of what other people think of it.

It's the greatest instrument you'll ever own.

Hack, even if you have nowhere to do it but your living room.

Read the manuals, even if you don't follow them.

Do not read industry best practices. They will only make you feel inferior.


Get to know your ancestors. You never know when they'll be gone for good.

Be nice to your competitors. They're your best link to your past and the players most likely to stick with you in the future.

Understand that peers come and go, but with a precious few you should hold on.

Work hard to bridge the gaps in geography and lifestyle, because the older you get, the more you need the people who knew you when you were young.

Live in New York City once, but leave before it makes you hard.

Live in Silicon Valley once, but leave before it makes you soft. Travel.

Accept certain inalienable truths:

Codebases will grow. Heroes will change alliances. You, too, will get old.

And when you do, you'll fantasize that when you were young, code sizes were reasonable, heroes were noble, and newcomers respected their elders.

Respect your elders.

Don't expect anyone else to support you.

Maybe you have a day job. Maybe you're funded by wealthy investors. But you never know when either one might run out.

Don't turn a blind eye towards ergonomics, or by the time you're 40 you will feel 85.

Be careful whose advice you buy, but be patient with those who supply it.

Advice is a form of nostalgia. Dispensing it is a way of fishing the past from the disposal, wiping it off, painting over the ugly parts and recycling it for more than it's worth.

But trust me on the small code bases.

With thanks to Steve Yegge (via Jeff Atwood via Simon Willison), Joel Spolsky, Paul Graham et al, and apologies to Baz Luhrmann.

2007-11-13

Embrace your constraints

As a web developer, it is your job to love and leverage the straight-jacket that is your confinement. This may sound wrong to you, at first. Working with the grain, and not against it, likely sounds better, but it does not as closely capture the essence of it, as being a successful web developer really is about understanding and loving your imprisonment. This imprisonment should not be fought, it should be understood, and your innovation space should not extend beyond the solution space it offers, or you will produce crap. The very finest, expensive, useless crap.

To excel is to understand and embrace the solution space offered by the properties and features offered by your platform, never, ever, fighting to replace them with crutches in your own image. Crafting new and improved basic UI widgets from scratch, for instance, is a job for Apple, Microsoft, Opera Software, GNOME or other providers of low level widget architectures operating on a C, C++ or ObjectiveC level -- people that create the build-once, use-everywhere interface components which users learn-once, use-daily. Don't knock this, even if your scroll bar would be off white, looking oh, so stylish!

It would be fundamentally broken in myriads of subtle ways, affecting your entire user base, to varying degrees, starting at "something unfamiliar and new to have to learn" for your novice users, to "not implementing 75% of the feature set I use", for your expert users.

The expert browser user is steadily gaining in numbers with today's generation of kids who have been around the UI elements of the giants since birth. It will annoy them that they can't use the scroll wheel. Or click the scrolled pane, arrow around, page up and down, home, end, via the keyboard. Or that their browser's Find feature might not scroll the found content into view. Or that it does not obey how they have configured their environment to behave, with respect to smooth scrolling, scroll wheel yield, modifiers held down at the same time to change yield, that their keyboard and locale's settings for which keys to press at the same time to mean "Page Up" should not trigger some other surprising effect that your American layout would not.

"Hey; that's unfair!", you might say; "How should I be able to know and circumvent that?"

Correct -- you shouldn't. It is not your problem, and thus you should not touch it either. All of these are things you can not and should not try to mimic in a browser environment. Keyboard handling, for one thing, is still fundamentally flawed and unstandardized in the browser UI model, as reinventing user interface workings has not been the primary focus of the web, so the provisions for doing so are sketchy and not unlikely prone to be leaky abstractions. Your user base are using a myriad different environments, rarely if ever sharing a common computer, operating system, browser, locale or keyboard layout, if what you think of as a keyboard even happens to be their input device of choice.

The solution space you operate in is on a level above all that. You can't, and should not, care about such details. You should blend in with how their web works, not set out to change it, on this level. You may have to tell your superiors (or users!) that what they want, if they ask for this, should not be addressed by your software, which lives in the wrong end of the spectrum. That you could theoretically do something at best almost as good as what they already have, after draining endless amounts of resources on a wild goose chase.

You address higher level problems, and there is a mark on the abstraction chain below which you should never tread. These are the laws of innovating the evolving browser straight-jacket, from within. Breaking them is negative productivity, degrading your user experience. Work within, and your application is timeless, growing with you and the platform. Do not put up with tying your application down into today's constraints unless that is a goal in and of itself.

2007-11-01

Javascript Base64 singleton

Similar to the javascript MD5 singleton micro-lib I wrapped up some time ago (source code), here is an even tinier Base64.encode / Base64.decode singleton (source code) I reworked from Tyler Akins' public domain original. Absolutely no error handling or recovery in this variant, so don't pass anything but eight-bit data to Base64.encode, nor correct base64 data to Base64.decode.
// Based on public domain code by Tyler Akins <http://rumkin.com/>
// Original code at http://rumkin.com/tools/compression/base64.php

var Base64 = (function() {
function encode_base64(data) {
var out = "", c1, c2, c3, e1, e2, e3, e4;
for (var i = 0; i < data.length; ) {
c1 = data.charCodeAt(i++);
c2 = data.charCodeAt(i++);
c3 = data.charCodeAt(i++);
e1 = c1 >> 2;
e2 = ((c1 & 3) << 4) + (c2 >> 4);
e3 = ((c2 & 15) << 2) + (c3 >> 6);
e4 = c3 & 63;
if (isNaN(c2))
e3 = e4 = 64;
else if (isNaN(c3))
e4 = 64;
out += tab.charAt(e1) + tab.charAt(e2) + tab.charAt(e3) + tab.charAt(e4);
}
return out;
}

function decode_base64(data) {
var out = "", c1, c2, c3, e1, e2, e3, e4;
for (var i = 0; i < data.length; ) {
e1 = tab.indexOf(data.charAt(i++));
e2 = tab.indexOf(data.charAt(i++));
e3 = tab.indexOf(data.charAt(i++));
e4 = tab.indexOf(data.charAt(i++));
c1 = (e1 << 2) + (e2 >> 4);
c2 = ((e2 & 15) << 4) + (e3 >> 2);
c3 = ((e3 & 3) << 6) + e4;
out += String.fromCharCode(c1);
if (e3 != 64)
out += String.fromCharCode(c2);
if (e4 != 64)
out += String.fromCharCode(c3);
}
return out;
}

var tab = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=";
return { encode:encode_base64, decode:decode_base64 };
})();

2007-08-22

Extending Greasemonkey by way of getters

Javascript getters and setters may be one of the most useful upcoming language features today (already working in some browsers, including Firefox), but google the feature and you will be swamped with articles and posts, not on how to put them to good use, but how to defend against malicious use of them for web pages tripping (Firefox / ditto extension) security holes. It's really just John Resig that stands out of the crowd, showing how it is done, suggesting good uses, and teaching the stuff.

Getters and setters are a Javascript 1.6 (IIRC) language feature that lets you provide interfaces that look and feel like DOM 0 interfaces, where you read and set a variable or property like document.title or document.links, and receive something that may or may not require computation, compositing and much else behind the scenes. The only news is that you can now implement things like the location object (set href and affect all the other properties in one go) from mere javascript and that the syntax is even rather friendly, by javascript standards:
var myLoc = (function() {
var url; // private variable hidden in the function closure

return {
toString: function() {
return this.href;
},
get href() {
return url;
},
set href(h) {
return url = h;
},
get prototcol() {
return url.match(/^.*?:/)[0];
},
set protocol(p) {
url = p + url.match(/^.*:(.*)/)[1];
return p;
}
};
})();
- which will yield you an object with an href and a protocol property you can tweak, just like any other, and see changes reflect to the other property, what you see if you alert(myLoc), and so on. It's a dream for making APIs that don't feel like java pushed down your throat with a ten-foot pole.

I've been meaning to add some features to Greasemonkey to grant it proper introspection (as all great software systems are introspective), for the Greasemonkey header block, and once I thought twice about it, for the source code at large. Something like a self object, with a "source" property for the code and a headers property for the parsed headers -- so self.headers.include would be the array of all @include headers, and so on.

I wanted that feature to give anybody the option to make their own cool extensions and libraries to Greasemonkey, making use of structured information -- much like an @xpath feature I've been noodling on on and off over the past few months to make GM scripts much more maintainable (and quick to write, and, optionally, self deprecating).

Anyway, the natural way of implementing this is via getters, and I was just about done when I hit a wall with Mozilla's extension API which wouldn't let the script executing in the Greasemonkey sandbox read my exposed getter variable (at the time named GM_headers): it threw a "Permission denied to get property Sandbox.GM_headers" exception when code accessed GM_headers, and that was that. Is this feature crippled for extensions wanting to provide getter / setter APIs from javascript code, or do I miss some incantations to bless the sandboxed code to read that?

The (hopefully) full picture of relevant details, surrounding my implementation:
var safeWin = new XPCNativeWrapper(unsafeContentWin);
var sandbox = new Components.utils.Sandbox(safeWin);
// ...
sandbox.__proto__ = {
get GM_headers() {
return headers.get();
}
};
sandbox.__proto__.__proto__ = safeWin;
This misfeature is also what stopped us from getting a responseXML property on the GM_xmlhttpRequest response object, earlier this spring. Any Mozilla hackers out there with solutions, knowledge or even just input on this? Your help and insights on this would be most valued.

2007-06-25

Users drift, time zones vary

In a time of user centricity on the web, I have still not seen any web site or service that groks that time zone is not a per-author (or sometimes even per-visitor) configuration setting, but a function of present location, which in turn is a variable entity over time. (Hint: per-item, with a preference towards suggesting last used time zone, works a lot better. Do it in style, presenting it by the widget you have that shows creation time or similar, so it reads something like "June 25, 11:36, Europe/Stockholm time", relevant fields appropriately editable.)

I entered my flight details (I'm visiting San Francisco most of the later half of July) in Google Calendar the other day, but couldn't tie down start and end points in geography, so it shows my westward flights as ridiculously short, and my first-hop home flight in later July as a huge 24-hour stretch from San Francisco to Amsterdam. There are not always 24 hours in a day when in transit, but it's a very common assumption when crafting user interfaces. Of course it's a can of worms coming up with a good visualization of what time it is throughout the day, if you're in Sweden the first few hours of the day, in London by lunch and Vancouver at the end of the day, adding hours to the day every hop on the way.

Maybe FireEagle and cool hacks like this might pave the way towards betterization on this front, for the "here and now" kind of tools, like blogs, instant messengers and other social applications.

Good OpenID integration for the presence awareness broker would of course be another huge hit. That probably needs integration work from both sides, as the time for breaking some of the assumptions ranted about above comes upon us. I personally have a feeling this forthcoming user centric wave will hit the web from the client rather than server side, though.

These days, neither clients nor users are as dumb as they used to be treated as. But most server side code still is.

2007-06-13

Javascript MD5 singleton

I really love how Yahoo! are spreading the gospel on sound javascript code practices and design patterns, popularizing good ideas among the masses, the most recent example being the module pattern.

In some ways it's the pop version of Douglas Crockford's writings on javascript, in this case private members in javascript, which tells the story of how almost any higher aspect of programming in javascript is accomplished through exploiting the few properties of the function keyword and closures. You can do most kinds of programming in javascript, there is just very rarely any syntactic aid for it.

By contrast, posts like this one would be the boring narrative recount telling you to read the gospel. I'm actually going to add a tiny bit more than that, though, and give an example of how to (re)structure code (not even necessarily your own) to use the module pattern, where it did not use to. It greatly helps keeping mess out of your way.

For a one-off hack (a user script that shows bookmark permalinks at del.icio.us, so you can share them with others, even before more people than yourself have bookmarked the url), I needed a javascript MD5 implementation. Paul Johnston's one (linked) is BSD licensed and works well, but exposes lots of namespace clutter, so I wrapped it up in an MD5 object, exposing only the bits I wanted from it.

Basically, I ended up framing the code inside a var MD5 = (function(){ /* original function definitions here */ return {hash:hex_md5}; })(); block, which exposes the only method I wanted as MD5.hash( data ). No namespace pollution, readable code.

Since an MD5 component is something I often want to use to interface with the world (here it was just to interface with Del.icio.us, which uses hex encoded MD5 hashes of urls for permalinks), I decided to go on and wrap up a version exposing the full (MD5 related) feature set of the lib, though, into a tidy MD5 module.

The API is very lightly wrapped, into methods MD5.string(data), which does a raw MD5 hash on its input string, MD5.hex(data), returning the same, but hexified, MD5.base64(data), ditto but base64 encoded instead, and the same method names with an "MD5.hmac." prefix for the hmac variants. (There is an MD5.test() method for good measure too.) The same config options as the original are supported too, and with the same default values, so set MD5.b64pad = "=" for real base64.

I thought I'd mention this too, here: if you want a similar javascript AES/Rijndael singleton, you should peek at Josh Davis's project on javascript cryptography, which has embarked on a similar path to the above since its conception. It might still be applying an UTF-8 codec on all data that passes through it, but if you need raw AES, just drop the few utf8 related lines, and you've got one.

2007-05-31

Google Gears: next quantum leap after XMLHttpRequest

Not much of great consequence has happened on the javascript, BOM and DOM in recent years since XMLHttpRequest and DOM 1.0. The large proprietary software people (Adobe, Apple, Microsoft, Sun) have all recently tried deploying new lock-in schemes of their own targeting the web to attract web developers into their proprietary, closed-source shackles under various flavours of EULAs and/or restrictions management, all of which will hopefully fail.

Brad Neuberg recently gave an Inventing the Future keynote speech at Yahoo FrontEnd Summit 2007, relating (among many things) that the secret to shaping the future is to be an inventive leader, in turn accomplished by combining leadership, great inventions and good values. There is much truth in that. I would personally add "working in the open" to the list.

This is exactly what Google is presently doing with Gears, which Aaron Boodman presented today (via) at Google Developer Day Sydney:



Google Gears is the next quantum leap in web development since XMLHttpRequest, addressing three of the largest issues with javascript webside development:

  • the lack of large scale (gigabyte range) client side storage,
  • offline availability of online resources, and
  • client side javascript freezing up the browser user interface due to its single-threaded design.
All this, in order to tackle the offline problem, which Brad Neuberg had incidentally been working with in the open for some time already for the Dojo Storage and Offline modules. Gears solves all rather beautifully with these three related but separate modules:

LocalServerLocalServer
caches and serves resources (HTML, javascript, images, et c.) locally,
DatabaseDatabase
stores data locally in a fully-searchable (SQLite) relational database
WorkerPoolWorkerPool
makes web applications more responsive by performing resource-intensive operations concurrently and asynchronously

I really recommend watching the half-an-hour presentation for the full story. Gears is already available for Firefox and Internet Explorer, soon for Safari, and, being fully new BSD licensed, allows anyone to port it to any other browser environment too. This is how you evolve the web. It is hardly coincidental that Aaron Boodman, who gave us Greasemonkey (licensed just as liberally) has been on the Gears team and gave the presentation.

2007-05-12

SVG challenge: craft an SVG in WGS84 coordinate space

In light of the rather successful outcome of my first SVG challenge (not called such at the time), I present another:

Challenge: craft an SVG with transform/viewbox/clipping settings to map WGS84 (that's latitude/longitude, as real numbers between -90.0/+90.0 and -180.0/+180.0 respectively) coordinates to the Mercator projection (used by Google Maps et al), making the image (conceptually) a client side WMS layer or map canvas object you can draw on via javascript using latitude/longitude coordinates, on top of (for instance) Google Maps.

The outcome of such a feat would be a boom in what you can (tractably) achieve, using only client side tools, in the geomapping department, or map mashups, in broader terms. Feel free to use the Google Maps API, if it helps in the number crunching department. (I have a feeling the linear algebra could and should be pulled off with just well engineered transform matrices, though.)

2007-05-06

Useless content-*-type meta tags

I think it was in David Flanagan's Javascript: the Definitive Guide (recommended) that I picked up on the <meta name="content-script-type" content="text/javascript; charset=UTF-8"> tag, and, similarly, content-style-type, which lets you pick a default content-type for the <script> and <style> tags respectively (or so it would seem, to the naked eye).

I can't seem to find the book to verify the phrasing in JSDG, but at the time I read it, it sounded like a lovely way of doing away with the repetitive typing of type attributes for all tags of those kinds, throughout the entire document. Great!

Well, not so.

The HTML DTDs still require those type attributes (if you want your documents to verify as correct HTML), and the purpose of those meta tags is purely senseless standards masturbation mode and not this convenient Don't-Repeat-Yourself way of specifying a type and character set in one place only (so my ;charset=UTF-8 appendix is particularly misleading). The intent? Well, they mark the content type of the tag attributes (onclick="...", ..., style="...") where the HTML standard already declares sensible defaults. The relevant parts of the HTML standard linked above are very fuzzily phrased in terms of setting the default language for a document, so the misinterpretation is certainly very easy to do.

Thus it's not the ideal marriage with the pragmatic minimalist Douglas Crockford's advice of just dropping those type attributes all together and have browsers and web servers do the right thing for you. If you want your web pages to keep working after you have saved them (and their relatively linked required resources) to disk, you'd better keep those type attributes with charset specifications, or your browser will have to guess.

Which might not have been such a bad thing, had they only guessed UTF-8 for anything that could possibly be UTF-8. Oh, well.

2007-04-29

XPath bookmarks

The web doesn't have any good way of bookmarking any spot in a web page. With some help from the web page author, we can bookmark a specific anchor or node id in the page, but most particular spots are still not reachable for bookmarks. I just tossed up a little user script that makes any node in the page addressable by an XPath query bookmarkable. It's mostly for XPath power users, for now, but works well (and lets you load bookmarks using that technique, which you might have gotten from such people).

The source code (install from here) is extremely short:
var path, node, hash = decodeURIComponent( location.hash||'' );
if( (path = /^#xpath:(.+)/.exec( hash )) &&
(node = $X( path[1] )) )
node.scrollIntoView();

function $X( xpath ) {
return document.evaluate( xpath, document, null, 0, null ).iterateNext();
}
Having installed that, you can load bookmarks like http://tibet.dharmakara.net/TibetABC.html#xpath:/html/body/h2[3], and get zoomed in to the right part of the page immediately (here, the part featuring how the Tibetan numbers are spelled, what they look like, and approximately how to pronounce them, for us westeners).

Scope user scripts to HTML pages

Most user scripts, and especially user scripts writing the DOM (injecting a user interface of some sort, for instance) should, but don't, start with this line of code:
if( !(document.contentType||'').match( /html/i ) ) return;
Which means what? Well, it ascertains that the page loaded is HTML, which people tend to take for granted, but which is not the case for all pages on the web. Especially, it is not the case for text/plain pages, by the broad masses more commonly known as *.txt. In Firefox, text documents get rendered much like HTML pages, but in a <pre> encasing.

Saving the document will remove this "HTML enclosure", but if your script injected some other junk, like an interface of some sort with some text, for instance, the saved page also will. This is probably not what you wanted. It is at the very least certainly not what unsuspecting users of the script wanted.

Edit: a modern GM+Firefox combination gets to run at XHTML pages. (This has not always been the case. Thanks for the correction, Rod McGuire!)

But please do decorate your scripts with the above line. It's royalty and patent free software with an irrevocable, DRM free license for all time. Public domain, at its best, working for you. Cheers!

2007-04-25

Make low-tech people publishers with EditGrid

In the real world, I regularly sing (tenor) in my local choir. Choirs have some boring administrative burdens, like keeping track of what sheet music is being sung now, and what should be returned to the choir library. In our choir, it's manual labour handled by each member and coordinated by a clerk we elect every season.

Anyway, that person is typically more neat than technical. I crafted some help tooling for her (as administrator) that gives her a minimum-maintenance publishing system, to show us, right on our internal home page, what sheets we should have and what to return. She does not need to handle messy web tech, our web page needed no server side hacks, and all she does is edit what looks and feels like her old Excel document she used to keep our sheet music in, but at EditGrid:


I added another sheet for her, stating the out/in dates (to the precision she wants) and ids of the songs we sing (for her own reference, she adds some additional; names and composers, typically):


This EditGrid spreadsheet is open for public browsing. Firefox+Firebug users beware: unless you turn off Firebug for that domain, on visiting an EditGrid spreadsheet, your whole Firefox session (all tabs of all open windows) will freeze beyond salvation (due to issues with Firebug's XmlHttpRequest monitor, if I remember correctly). To avoid that issue, first go to the front page and right-click the Firebug icon, selecting "Disable Firebug for www.editgrid.com".

Then it's safe proceeding to the spreadsheet itself. The administrator of course has an account with edit rights to the data; you will only be able to browse it.

As mentioned in a previous post, EditGrid can export data in mostly any format you want, including JSON, if not out of the box yet (EditGrid devs: even if you don't support native JSONP yet, it would be helpful if you let users share their xsl transforms, so it gets easier to copy recipies such as this one). As I already had set up my account with a nice JSON data format exporter, I opted to reuse that (with the future option to make an Exhibit interface for the data set of what we sung when and the like, without any fuss).

With my data format (Exhibit JSON, actually), the file looks a little like this:

editgridCallback({"items":[
{"type":"Noter","löpnr":"M01","titel":"Beati sunt",...},
{"type":"Noter","löpnr":"M02","titel":"Partitur, Mariamusik",...},
...
{"type":"Ut/In","ut":"2006-","löpnr":"M63",...},
{"type":"Ut/In","ut":"2006-","titel":"Dobos",...},
{"type":"Ut/In","ut":"2006-","titel":"Östgötasången",...},
{"type":"Ut/In","ut":"2007-","titel":"Sånger från Taizé",...},
{"type":"Ut/In","ut":"2005-","titel":"Himlen & jorden sjuder",...},
{"type":"Ut/In","ut":"2006-","löpnr":"M90",...},
{"type":"Ut/In","ut":"2007-03-29","löpnr":"M09",...},
{"type":"Ut/In","in":"2007-04-19","löpnr":"M53",...},
...
]})

Then I wrote up some javascript for our web page that imports the data, listing works we should have and should return, depending on whether today's date is in the given date range or not. Notes past their due back date show up in the latter list, and notes that have not yet been handed out to us are not listed at all. There are some extra features for listing notes without any dates at all (meaning "you should have them, but I don't remember since how long back") and so on.

The result doesn't look much, it but does the job very nicely. See the source code, liberated of blog template cruft separately. (The bit at the end adding a timestamp to the URL is needed to prevent your browser from over-caching the data; EditGrid does not yet seem to send proper HTTP headers about content modification.)

...What we're singing now? See for yourselves (second list folded by default, to conserve space):

2007-04-13

XPath shorthands $x and $X

One good thing about work is I find it a lot easier to be rational about building (or buying) the best tools for the job. I've known for over a year that I should have the power tools I equip almost all my Greasemonkey scripts with in the Firebug console too, but never came further than to request the feature, at some time, ages ago. Today I extended Firebug's $x(xpath) to handle $x(xpath, contextNode) too (catering relative xpaths) and to return strings, numbers or booleans, when the expression resulted in such output.

$X is a variant on $x, which will return a node rather than an array, when the result was a node set. Instead, you get the first match, in document order. You'd be surprised how comfy and useful that is. These tools cut down the user script development feedback loop overhead for me quite noticeably. They also work even in a framed environment (when you pass a context node), which the former Firebug $x did not.

After patching up the build script a bit until it worked (so I could test it out), I submitted the patches to the Firebug list, so they might end up in the upstream 1.0.6 build as well.

Relative XPaths are very useful for answering lots of intricate questions about web pages that you'd have to work for quite a while with the DOM inspector or Firebug's HTML view to figure out. If you like me like to bookmark comments you've written on some web page, you ideally want to find the closest anchor before the place where your comment showed up. Good blogs have easily clickable links around the comment to help you do that; others, like Ajaxian, don't.

With this hack, you find the closest preceding anchor by invoking Firebug's Inspector, picking out your comment, opening the Firebug command line (Command or Control + Shift + L), and typing, for instance,

node = $X('preceding::*[@name or @id][1]', $0); location.hash = '#' + (node.id||node.name);

Then it's just the matter of bookmarking as usual. Since Firebug's console features are not exposed to the page, you unfortunately can not make a bookmark out of it, but it's nevertheless a good example of what you can do with this.

2007-04-05

Trac(k)ing svn repositories

I had been dragging my feet for a while, hoping the Simile subversion repository would get a nice web based Trac timeline over commits and change sets, if I asked nicely and waited patiently. That very often helps with open source projects (Edit: this time too, eventually), but not always.

Last weekend I took to setting up a local mirror of the repository to set up my own, local, Trac, just to get that timeline/changeset browser combination. I find it indispensable for software development with more than one developer (and a very useful tool, even when you're on your own).

Thanks to SVK (a rather mature perl concoction running atop the subversion filesystem and remote access layers -- see the svk book for more info, for instance), it is actually rather comfortable to set up your local subversion mirror of a remote repository, whether your own or someone else's. Even saves you some disk compared to a common subversion repository.

Anyway, to do that I first created a vc user (for "version control", for lack of imagination) as I wanted to avoid mixing up my own user's SVK depot (kept in ~/.svk) with the repository I wanted to get Trac coverage for, in System Preferences. Running as that user, here are the steps I needed to take to get this running on my (fink enabled) macbook:
# installing the packages:
sudo fink install trac-py24
sudo fink install svk

# setting up the local repository mirror and syncing up:
svk mirror https://simile.mit.edu/repository //mirror/simile
svk sync //mirror/simile

# pointing Trac at it
mkdir ~/trac
trac-admin ~/trac/simile initenv
Then I got to answer to some questions; name my Trac instance Simile, point it to the repository root /Users/vc/.svk/local, say that it is an svn repository and get pointed to the config file ~/trac/simile/conf/trac.ini, which needed some editing.

If you want full commit messages in the timeline (I do), make sure you keep wiki_format_messages in the [changeset] section and changeset_long_messages under [timeline] both set to true. You'd think these options are orthogonal from their names, but they are not; the latter is happily ignored if you turn off wikiml. So even if, like in my case, commit messages aren't wikiml markup tied to the wiki and issue tracker inside this Trac instance, it's either pretend it is, or get truncated commit messages.

Starting it is done with tracd --port 8000 ~/trac/simile (or something more permanent, by way of Apache or similar), and you can browse at http://localhost:8000/, once come this far. If you want some more linkage and less round trips between views, feel free to tuck in my Trac Timeline and Trac Changeset improver Greasemonkey scripts crafted for this particular purpose.

Does anyone know of any Trac plugins to export changeset data as JSON / JSONP? I have some plans for equipping the setup with a facet browser and some commit message searchability, and it would be a great help if I didn't have to write them myself. Even better would be getting that installed at DevjaVu, so lots of other people could benefit from the same hack. Including my own repositories. (Which I could arguably mirror, but where is the fun or elegance in that?)

2007-03-09

Flickr API exhibit

Some time ago I had a peek under the hood of the Flickr API (for digging up Flickr tags off photos, IIRC). At the time, I thought I was seeing the whole picture of the API and its mechanics, and was horrified at the hoops you had to jump through to do just about anything with it, if you consumed it from client side javascript. Sorting query attributes, md5 hashing, and having to guide a visitor through some page at the Flickr site to authenticate your application to act on some privilege level as the visitor's own Flickr user. It's safe to say that the hoop-jumping took all the pleasure away from working the API for me, and I decided I would not.

Flickr API docs As it turns out, I had failed to notice that 48% of the API could be accessed without jumping through all those hoops. Thanks to a sudden flurry of JSONP activity from Henrik (I got to answer some question on it), I saw that he used APIs I had done hoop jumping to get at.

It's amazing how much work it is to tell which Flickr methods are available unauthenticated. It should just be a click or two, or none at all and visible in an overview at the docs pages. I made the next best thing: a static Flickr API refdocs exhibit. So now you can. And you can browse the method call docstrings in a clickety manner without waiting for pages to load, either; all the documentation is loaded into one single place.

Just descriptions, privilege levels and so on, for now; for argument descriptions and the like, you'll still have to click the method to load the full page, but seeing as there is a neat reflection API in place, I might just make those loadable into the page, too, as another Exhibit and JSONP exercise.

That gets up to date documentation too, though I guess it might take a while to load, from doing 101 HTTP requests to pull in the data set as soon as you load the page. :)

Might be a good test about how Exhibit performs under such conditions too, though.

2007-03-08

XPath as a first-class citizen

The DOM standard packs great power, one of them that of XPath, for slicing and dicing DOM trees (whether originally from XML or HTML input). Compared to the very good integration of similar javascript language features -- E4X in recent javascript implementations, or the RegExp object, with us since Javascript's earliest days, the DOM XPath interface is a very sad story. I will refrain from telling it here, and primarily focus on how it should be redone, to better serve script authoring. My scope in this post is the web browser, so as not to confuse things with, say, ECMAScript, which is a language used in many non-DOM-centric environments too.

First, we might recognize that an XPath expression essentially is to a DOM Node tree what a RegExp is to a string. It's a coarse but valid comparison, and one that should guide how we use XPath in web programming. From here, let's thus copy good design and sketch the outlines of an XPath class that is as good and versatile for the DOM as RegExp is for strings.

Javascript RegExps are first class objects, with methods tied to them that take a data parameter to operate on. XPaths should be too. Instantiation, using the class constructor, thus looks like:
var re = new RegExp("pattern"[, "flags"]);
var xp = new XPath("pattern"[, "flags"]);
The respective patterns already have their grammars defined and are thus left without further commentary here. RegExp flags are limited to combinations of:
"g"
global match
"i"
case insensitive
"m"
multiline match
The XPath flags would map to their XPathResultType counterparts (found on the XPathResult object) for specifying what properties of the resulting node set you are interested in (if you write docs for that horrible API, please copy this enlightening table):
Nodes wantedBehaviourUnOrderedOrdered
MultipleIteratorUNORDERED_NODE_ITERATOR_TYPE=4ORDERED_NODE_ITERATOR_TYPE=5
MultipleSnapshotUNORDERED_NODE_SNAPSHOT_TYPE=6ORDERED_NODE_SNAPSHOT_TYPE=7
SingleSnapshotANY_UNORDERED_NODE_TYPE=8FIRST_ORDERED_NODE_TYPE=9

that are reducible to permutations of whether you want a single or multiple items, want results sorted in document order or don't care, and if you want a snapshot, or something that just lets you iterate through the matches until you perform your first modification to a match. There are really ten options in all, but NUMBER_TYPE=1, STRING_TYPE=2 and BOOLEAN_TYPE=3 were necessitated by a design flaw we shall not repeat, and ANY_TYPE=0 is the automatic pick between one of those or UNORDERED_NODE_ITERATOR_TYPE=4. Let's have none of that.

Copying some more good design, let's make those options a learnable set of three one-letter flags, over the hostile set of ten types, averaging 31.6 characters worth of typing each (or a single digit, completely devoid of semantic memorability). When desigining flags, we get to pick a default flag-less mode and an override. In RegExp the least expensive case is the default, bells and whistles invokable by flag override; we might, for instance, heed the same criteria (or, probably better still, deciding on what constitutes the most useful default behaviour, and naming the flags after the opposite behaviour instead):
"m"
Multiple nodes
"o"
Ordered nodes
"s"
Snapshot nodes
I briefly mentioned a design error we shouldn't repeat. The DOM document.evaluate, apart from having a long name on its own, and further drowning you in mandatory arguments and 30-to-40 character type names, does not yield results you can use right away as part of a javascript expression. Instead it hands you some ravioli, in the form of an XPathResult object, which you may pry the actual results off, by jumping through a few hoops. This is criminally offensively bad interface design, in innumerable ways. Again, let's not go there.

It might be time we decided on calling conventions, so we have some context to anchor up what the results returned are with. Our XPath object (which, contrary to a result set, makes lots of sense sticking into an object, to keep around for doing additional queries with later on without parsing the path again) has an exec() method, as does RegExp, and it takes zero to two arguments.

xp.exec( contextNode, nsResolver );

The first argument is a context node, from which the expression will resolve. If undefined or null, we resolve against document.documentElement. The context node may be anything accepted as a context node by present day document.evaluate, or an E4X object.

The second argument is, if provided as a function, a namespace resolver (of type XPathNSResolver, just as with the DOM API). If we instead provide an object, do a lookup for namespace prefixes from it by indexing out the value from it, as with an associative array. In the interest of collateral damage control, should the user have cluttered up Object.prototype, we might be best off to only pick up namespaces from it whose nsResolver.hasOwnProperty(nsprefix) yields true.

The return value from this method is similar in intended spirit to XPathResult.ANY_TYPE, but without the ravioli. XPaths yielding number, string or boolean output returns a number, string or boolean. And the rest, which return node sets, return a proper javascript Array of the nodes matched. Or, if for some reason an enhanced object be needed, one which inherits from Array, so that all (native or prototype enhanced) Array methods; shift, splice, push and friends, work on this object, too.

Finally, RegExps enjoy a nice, terse, literal syntax. I would argue that XPaths should, as well. My best proposal (as most of US-ASCII is already allocated) is to piggy-back onto the RegExp literal syntax, but mandate the flag "x" to signify being an XPath expression. Further on, as / is a too common character in XPath to even for a moment consider having to quote it, make the contents of the /.../ containment a '-encased string, following all the common string quoting conventions.

var links_xp = /'.//a[@href]'/xmos;
var posts_xp = /'//div[@class="post"]'/xmos;


for instance, for slicing up a local document.links variant for some part of your DOM tree, and for picking up the root nodes of all posts on this blog page respectively. And it probably already shows that the better default is to make those flags the default behaviour, and name the inverse set instead. Perhaps these?
"s"
Single node only
"u"
Unordered nodes
"i"
Iterable nodes

When requesting a single node, you get either the node that matched, or null. Illegal combinations of flags and XPath expressions ideally yield compile time errors. Being able to instantiate new XPath objects off old ones, given new flags, would be another welcome feature. There probably are additional improvements and clarifications to make. Good ideas shape the future.

2007-03-07

Firefox content type bugfix extension

The not much heard of Open in browser extension is a great improvement over the Firefox deficiency of disallowing user overrides to the Content-Type header passed by web servers. The Content-Type header, if unfamiliar, is what states the data type of files you download, so that your browser can pick a suitable mode of decoding (if it is a JPEG picture, use the native JPEG decoder, for instance, while showing text files as text) and presenting the data.

Web server admins are people too that make mistakes, or occasionally have weird ideas contrary to yours about how to present a file (prompting with a Save As... dialog for plain text, HTML, or images, most frequently), instead of showing it right in the browser, and a native Firefox lets them rule you. This extension grants you the option of choice.

It presently (v1.1) does not allow you to specify any legal content type override, but it handles the basic xxx/yyy types, while considering "text/plain; charset=UTF-8", for instance, illegal. This seems to be a common misconception about Content Types (or MIME types, as they are also commonly called), which I would like to see fade away. interested in references, §14.17 of RFC 2616 (HTTP) states that the leading type/subtype declaration may be followed by any number of {semicolon, attribute=value pair} blocks, so if you are tempted to do validation on legal content type declarations, for some reason, don't disallow those.

Excerpt of the relevant ABNF, if you want to generate a proper grammar validator:

media-type     = type "/" subtype *( ";" parameter )
type = token
subtype = token
parameter = attribute "=" value

attribute = token
value = token | quoted-string
token = 1*<any CHAR except CTLs or separators>

CHAR = <any US-ASCII character (octets 0 - 127)>
separators = "(" | ")" | "<" | ">" | "@"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT

CTL = <any US-ASCII control character
(octets 0 - 31) and DEL (127)>

quoted-string = ( <"> *(qdtext | quoted-pair ) <"> )
qdtext = <any TEXT except <">>
<"> = <US-ASCII double-quote mark (34)>

TEXT = <any OCTET except CTLs,
but including LWS>
OCTET = <any 8-bit sequence of data>
CTL = <any US-ASCII control character
(octets 0 - 31) and DEL (127)>
LWS = [CRLF] 1*( SP | HT )

CRLF = CR LF
CR = <US-ASCII CR, carriage return (13)>
LF = <US-ASCII LF, linefeed (10)>
SP = <US-ASCII SP, space (32)>
HT = <US-ASCII HT, horizontal-tab (9)>

quoted-pair = "\" CHAR


If not, and you, say, want to do it with a Javascript regexp instead, here is a free regexp to choke on, if you really do want to do strict content type validity checking by (javascript) regexp, rather than, say, check for validity using a more lax variant, perhaps /.+\/.+/:

var validMIME = /^[^\0- "(),\/:-@\[-\]{}\x80-\xFF]+\/[^\0- "(),\/:-@\[-\]{}\x80-\xFF]+(;[^\0- "(),\/:-@\[-\]{}\x80-\xFF]+=([^\0- "(),\/:-@\[-\]{}\x80-\xFF]+|"(\\[\0-\x7F]|[^\\"\0-\x08\x0B\x0C\x0E-\x1F])*?"))*$/;

As you see, regexps are really horrible tools for doing this kind of thing, though, but with a bit of pain they can do the work. I'd suggest keeping a link to this page in a short comment in your code, should you adopt that monster, in case you will ever have any issues with it, or need to work out why it bugs out. Chances are your IDE does not let you mark up token semantics the way I did above.

2007-03-02

Google Pages command-line upload

Google Pages offers you 100 megabytes of free web storage, where you can put html, images, text, javascript, music, video and pretty much whatever you like. You also get five free sub-domains of your choice to googlepages.com. That's on the plus side.

(I have been toying with Exhibit showcase hacks there, gathering up my Exhibit hacks and mashups as I write them.)

On the minus side, you can presently only drop files in a single directory level, files are typically not served at greased-weasel speed and latency and you have to use either a Firefox or Internet Explorer browser to post them, and using an ajaxy form at that -- no sftp, ftp, webdav or HTTP PUT access in sight. (I also believe I've read about a top number of files per site in the 500 range.)

Anyway, I tried to craft my first shaky ruby hack last week, to get a command line client which would let me upload files in batch. I unfortunately failed to navigate the Google login forms rather miserably (should someone want to point out what I do wrong, or rather how I ought to do instead, the broken code is here; a good-for-nothing dozen-liner of WWW:Mechanize).

So I resorted to the classic semi-automated way: logging in by browser, stealing my own cookies and doing the rest of the work with curl. It works, and is less painful than tossing up fifty-something files by mayhem-clicking an ajax form upload, however comparatively nice they made it with a web default style form. This recipe requires a working knowledge of your shell, an installed curl, and being logged in to Google Pages and having chosen the appropriate site.

Then invoke the cookie stealer bookmarklet and copy the value to your clipboard. I suggest you head over to your shell right away, type export googlecookie='' and paste your cookie between the single quotes.

Head back to your browser window, to invoke the auth token post url stealer bookmarklet, which picks up the form target url. Copy it to your clipboard, head back to the shell and type export googletarget='' (again pasting the value between the single quotes). Now you're set.

To upload a file now, all you need to do is run curl -F file=@$filename --cookie $googlecookie $googletarget and it gets dropped off as it should. And zooming up a whole junkyard of files is no more painful:

zsh> for i in *.png; curl -F file=@$i --cookie $googlecookie $googletarget

It's not pretty, but it is some pain relief, if you're hellbent on playing with free services rather than getting dedicated hosting. It's also a "because it's possible" solution -- and those are fun to craft, once in a while. I'd love to find out what I didn't figure out about taming the Google login form via Ruby, or vice versa. Your competent input most welcome.

2007-02-17

Krugle finds code

Krugle is an already seriosly good contender to Google Code Search, for digging up the source code of that <whatever> it is you are looking for. I love the interface which lets you slash down to the particulars you want, if you want to, or not. With a slap-on license filter, it would beat Code Search hands down. (They seem to consider adding it already, judging by their online survey asking whether we want it or not, and how badly.)

Its creators apparently just partnered up with the Yahoo! Developer Network. This is exactly the tool I wish I had some day ago, when I dug up Dojo's getComputedStyle, to hack up a greased-weasel stand-alone getStyle which should be portable across browsers.
Categories:

2007-02-11

JSONP and Spreadsheets

A year has passed since I wrote about the merits of JSONP, and since then, services have slowly started using it, allowing you to use your data from web pages anywhere in contexts outside of the site of the services in question. Adoption is not very rapid, but every service that adopts it and does it well is a great win for the online community (and, I argue, the services themselves), since they are increasing the leverage of web programmers all over the world to a great extent.

Possibly even more so than javascript libraries and frameworks on offer; at the heart of JSONP lies a core of very lightly wrapped, highly concentrated raw data served in the most easily digestable format possible for a javascript programmer. It's just the matter of pointing a script tag at the data, and the data comes to your code, an HTTP request and a function call later. You just name the callback and get to do whatever it is you want to do with the data. (Yes. It really is that simple.)

The topic of this post will be spreadsheets, as spreadsheets is a very useful and low-barriers-to-entry way to access and maintain organizing structured data, the bread and wine of web applications, widgets, aggregation services and other tools and accessories you live and breathe on the web. Every one processes data of some sort, and most have a need to get it from somewhere. And I think online spreadsheets have lots of merit here, addressing a sore need in a very good manner.

No online spreadsheet that I am aware of supports JSONP natively today, though you can coax at least Google Spreadsheets and EditGrid into "sort of" supporting it. In the Google case you get valid and a letter correct JSONP feed, which is however not JSONP in spirit, whereas with EditGrid you get as good a spiritual match as you care for, but no URL configurable javascript callback name. So neither is really there yet, but I'll discuss how even this is already enough to be useful, after a fashion.

Google Spreadsheets

Let's start with Google, as there is less to say about that. Nowhere in their interface is there an actual JSONP (or JSON) option to pick, but you can ask its ATOM feed formatter to reformat its data as (surface) JSONP and it will happily comply. The question, however, when devising an export format, is always "how would my data best be expressed for its target domain?", which, in the case of JSONP, is javascript. And since javascript has multidimensional arrays, which is what a spreadsheet gives a visual face to, and ATOM does not, the outcome when you reformat the ATOM feed as JSONP is, shall we say, suboptimal.

You reformat a Google Spreadsheets ATOM feed to JSONP by appending the URL query parameter ?alt=json-in-script at the end of its URL, and, making it interactable client-configured JSONP, Google accepts the additional parameter callback, naming our callback, and supplies it with the gobbeldegok that came out of the transformation. The ATOM feed is limited to single sheet output, and given all of these constraints, here is what (reindention for humans by me; the original feed is even less legible -- not to mention the horrible URL) comes out of this example spreadsheet:
/Sheet1\      /Sheet2
A:1 B:1 C:1       b:1    
A:2 B:2 C:2   a:2 b:2 c:2
A:3 B:3 C:3   a:3     c:3
In summary, it chewed off A:1, drenched the place with huge id URLs, links, pointers to schemas and other junk invented out of nowhere, added some useful metadata about the file and munged together cells into not separately addressable entities (an example of which being row 2, which came out as "A:2" and "B:1: B:2, C:1: C:2"). Still, even given this gooey mess of things, fine hackers such as David Huynh manage to craft gems atop it. I again warmly recommend Exhibit. Expect upcoming and more in-depth articles about it here, as I have amassed a large backlog of great hacks with it recently. It is a very hackable visualization engine indeed.

EditGrid

EditGrid does lots of things right, and in great style. They even care about making your URLs look good. Their developer interfaces are paid equal amounts of love and attention as are their user interfaces, which in this case is great praise for both.

The data you commit to an EditGrid spreadsheet not just sits there, but bends to your will and comes out in whichever data format you care enough about to craft or look up a conversion template for -- which is done on the fly by applying XSLT templates server side to the XML format they export to. To finish it off, all URLs are legible, short and converting to a different format using either of the built-in defaults or your own custom formats is a simple matter of changing the extension. Here is our example, again, at http://www.editgrid.com/user/ecmanaut/example. The live URL allows anyone a spreadsheet interface to the documet, and the author and those he chooses to share editing priviges with, the option of editing it. Watch the other permalinks below and feel the warm fuzzies inside:

FormatOutput
XML.xml
HTML.html
CSV.csv
PDF.pdf
Excel.xls
OpenDocument.ods
OpenOffice 1.x.sxc
Gnumeric.gnumeric
TeX source.tex

I took some time to hack up a few useful variants on JSON and JSONP formats of the data for myself (and indeed others; take your picks -- and should you feel like extending them further, I recommend working off the last one, which plays best on the strengths of XSLT, .null.cells.json). It's not full-blown JSONP, of course; you can't configure the callback name from the URL the way you should be able to (which becomes useful the instant you import more than one spreadsheet), but this is as close as we get without the interactive step.

I have organized the formats and the XSL templates by their properties; a few are standards compliant JSONP, following the JSON definition, some are even more size conservative and employ the short form used by javascript to define sparse arrays where there are empty cells. Those that are legal and interoperable JSON(P) (not all JSON consumers are javascript!) are tinted greenish in the table, the rest slightly rougeishly blushing.

Empty cells become ""Empty cells omitted
Plain JSON.xsl .json
{"workbook":[
{/*...*/},
{"name":"Sheet2","sheet":[
[{},{"type":"string","input":"b:1","value":"b:1","content":"b:1"},{}],
/*...*/]}
}
.xsl .null.json
{"workbook":[
{[/*...*/]},
{"name":"Sheet2","sheet":[
[,{"type":"string","input":"b:1","value":"b:1","content":"b:1"},],
/*...*/]}
}
JSONP, data only.xsl .data.jsonp
editgridCallback({"workbook":[
{"name":"Sheet1","sheet":[
["A:1","B:1","C:1"],
["A:2","B:2","C:2"],
["A:3","B:3","C:3"]]},
{"name":"Sheet2","sheet":[
["","b:1",""],
["a:2","b:2","c:2"],
["a:3","","c:3"]]}
})
.xsl .null.data.jsonp
editgridCallback({"workbook":[
{"name":"Sheet1","sheet":[
["A:1","B:1","C:1"],
["A:2","B:2","C:2"],
["A:3","B:3","C:3"]]},
{"name":"Sheet2","sheet":[
[,"b:1",],
["a:2","b:2","c:2"],
["a:3",,"c:3"]]}
})
JSONP, all properties.xsl .cells.jsonp
editgridCallback({"workbook":[
{/*...*/},
{"name":"Sheet2","sheet":[
[{"type":"","input":"","value":"","content":""},/*...*/],
[{"type":"string","input":"a:2","value":"a:2","content":"a:2"},/*...*/],
[/*...*/]]}
})
.xsl .null.cells.jsonp
editgridCallback({"workbook":[
{/*...*/},
{"name":"Sheet2","sheet":[
[,{"type":"string","input":"b:1","value":"b:1","content":"b:1"},],
/*...*/]}
})
Exhibit JSONP.xsl .exhibit.jsonp
editgridCallback({"items":[
/*...*/
{"type":"Sheet2","":"a:3","b:1":"","":"c:3"}]
})
.xsl .null.exhibit.jsonp
editgridCallback({"items":[
/*...*/
{"type":"Sheet2","":"a:3","":"c:3"}]
})

I'll get back about how to use those Exhibit JSONP to do great things. This post is mostly for reducing my backlog about things to blog to get things out where people can play with and work their own magic with them.

2007-02-06

Full list of web pages that link yours

Delivered by whom other than Google?

If you haven't already been wowed by latest news from Google, check out their webmaster tools. The full list of inbound links to all pages on your site, now downloadable as a CSV file listing your page, the linking page, and the last time the Google crawler found the link there.

Yep; they did it again. It's actually two priceless tools in one: the full registry of links on the web that point your way from external sites, and the full registry of links on your own site that point across it, in both cases listing both endpoints of the links.

Or browse them in their online interface, indexed on URL, listing the number of inbound links.

Update:

It is of course more fun to browse the data set from the comfort of your own local database (mysql prompt?); I swept up a quick pike hack to import it into one, so I could run queries on the data set like
SELECT COUNT(*) AS N,site,url FROM inbound
GROUP BY site ORDER BY N DESC LIMIT 25;
SELECT COUNT(*) AS N,site,url FROM inbound WHERE path!=""
GROUP BY site ORDER BY N DESC LIMIT 25;

to get top lists of sites linking my blog, and specific content on it respectively. And much fun was had. Set up a database (it picks "inbound", if you don't alter the script's header) and feed the script your csv file. (You might want to drop the inbound table if you do a later reimport; the script hasn't evolved into anything but splitting up the data on useful fields yet.)
Categories:

2007-02-04

(DOM)Node.textContent and Node.innerText

You can safely skip the next two paragraphs if you're in a hurry; it's just warm air.

I'm the first to admit I thrive in Mozilla/Firefox land, but would defect in an instant if I could bring Firebug, AdBlock, the Filterset.G updater and the Greasemonkey (well, its user interface and privileged API methods; the rest of the user script architecture in Opera beats any other web browser, hands down) with me, immigrating into Opera land. (Then I would spend the next few years peeking at the neighbour's green grass, wishing I could bring with me the best features from that environment to the next. But software unfortunately doesn't really work that way.)

Why? Social and emotional reasons mostly. Having closer and better connections with the Opera dev team (that's got to be a first), and seeing how fervently they profile, chip off overhead and beat their code into a Japanese razor sharp blade, with an staggering eye towards standards compliance and lightning speed. See? It's all a lot of emotional goo about an inflated feeling based on knowing the people who work there and more than occasionally hearing about how they spend their time.

Amusingly, I wrote up this post in a somewhat too tired state to take note of what the W3C Node.textContent DOM 3 specifications actually said, believing Firefox had it wrong and Opera current did it right. It's the other way around, fortunately.

Here is how Node.textContent works in Firefox: you have a node, read its textContent property, and out you get all its text contents, stripped of all tags and comments. A <tt>Some <!-- commented-out! --> text</tt> tag would thus give you the string Some  text. (Your browser, identifying itself as ".)

The Firefox behaviour is very useful. I love it. But it is unfortunately not how the W3C defined textContent to behave. Don't ask me why. The standards compliant correct result would be Some commented-out! text.

IE6 and IE7 do not implement Node.textContent at all, but have their own Node.innerText, which predates the DOM, and behaves the same way (barring whitespace differences -- those two space characters in the middle actually end up a single one).

Opera implements both, the way each was presumably defined neither actually presently quite on target. :-) As MSDN does not really define very well what innerText does, though, Opera actually implements innerText the way the Firefox textContent works.

Firefox only implements Node.textContent, gets it wrong right, and ended up implementing a useful behaviour insteadeed. If Firefox eventually decide on fixing this bug (I will not urge them to hurry; indeed I don't think an issue has even been filed yet -- and the present behaviour has been with us for as long as I have known it, anyway), I really hope they would consider delegating the present behaviour to innerText, instead, as does Opera.

Safari implements neither (innerText returns the empty string, though!) but amusingly has an innerHTML property which behaves the way Firefox's textContent does. (Yay Safari! ;-)

All the above concerns the behaviour of the getters of mentioned properties, which is the bit I have most interest in myself, for the most part. It's a great way to scrape data free of markup from pages, client side and at minimum effort, for instance in user scripts. I do this a lot.

Fortunately, for myself, I still mostly write user scripts for my personal needs, so it does not really matter that there is still a ravaging non-consensus war about the BOM (the browser object model) going on out there, even with the W3C trying to make them all agree about something. Some days, like when the W3C event bubble / trickle model was designed, for better, some days, like when they got textContent wrong, for worse.

Are there any ambitious people out there who have set up automated BOM test suites running around the ins and outs of the BOM of its visitors, collecting the results and presenting a continuously updated index of their findings? I would love to chip in some money to a good project like that. And if there aren't, here is an excellent opportunity for web fame and recognition for someone. I wouldn't mind mentoring it.

2007-02-03

Medieval Mailing List Software in the Web Age

I'm surprised that I haven't seen any widely deployed mailing list service list permalinks in their mail bodies. All the big players in this field add links to all mail on the list, to a lesser or greater extent, and they all keep a permanent record of all mails on the web. What is more, on almost all mailing lists, at least in the technical field, posters frequently have plenty reason to refer to prior discussions.

But none of them list permalinks. Not one! Whyever is that?

Actually, to be fair, Yahoo! Groups sort of does. One of the nineteen (19) links they spray every message with, four of which are mailto: links, one terms of service, and the other fourteen, unshaven filthy gobs of base64 encoded data identifying you, starting at 150 characters long and moving on up to 216 (in my sample -- a number which incidentally rings familiar as the largest length of a URL supported by some archaic mac or windows browser, if I'm not mistaken), the first of which, after an HTTP redirect (registering your click and washing away all the junk again) actually drops you off at a permalink.

Google Groups has three links; the address you mail your posts to, the address you mail to unsubscribe, and a link to the web interface for the list, where the really ambitious helpful kind of person can, you guessed it, search for a thread to find its permalink. Points for minimum cruft, but somewhat behind the times, I think.

Jeremy Dunck recently blessed the Greasemonkey list (and indeed any Google Group) with a user script hijacking the GMail interface to provide a shortcut search link to the thread you're reading, in Gmail - Find Thread in Google Groups.

That brings Google on par with Yahoo after some community support plastic padding for those who hang around the right bars and are in on the right buzz and use the right tools and services.

But nowhere on the web are mailing lists with permalinks. In 2007. I find that kind of fascinating.

2007-01-24

Great Free Services and Tools

DevjaVu: Free subversion and Trac hosting for your code. Great, warmly attitude from the start of the terms of service to their mode of implementation. You want something like Trac, especially for its timeline of recent commits, in my experience. The issue tracker and its automated ties to checkin messages ("Closes #4", for instance) is another great feature to have. Spoil yourself!

Userscripts.org uses this, since some time. It's rather easy to check out the code base for contributing, by the way; it wasn't many minutes before I had a test us.o server of my own running.

Google Code: Another free subversion repository hosting for your code, of a recognized brand. Very nice and visible way of declaring repository content licenses -- how kindly you allow others to use your work, for one thing. I have started migrating my growing collection of user scripts there, starting with most recent additions. I am a sucker for keeping a tidy repository, so I have postponed checking in my old junk until I've taken an inventory of what is still useful. Committing things somewhere is a rewarding feeling, when you're at the next good state, and recording a message of the change is simply priceless.

Greasemonkey is tentatively moving its code base and wiki here. Voices on the GM list have it the repository is a bit slow at times. I have not suffered much from it myself yet, but with my choice of repository, I would probably not notice even rather horrible performance.

Subtlety: Regardless of your pick of subversion repository, letting others keep up to date with what you do in it by way of feed is a snap with this beautiful Camping tool. Feed it a subversion repository URL and it gives you an RSS feed of commits, listing the checkin message, and links to the files it affected. It is of course similarly an excellent way of yourself keeping track of projects other people, perhaps not even including yourself, work on; here is one tracking the progress of Exhibit by David Huynh of the MIT Simile project. Which brings me to the subject of...

Exhibit. This is a beautiful client side tool for visualizing, resorting and filtering information in all sorts of ways. I'm fairly confident this is where browser technology is heading, and I'm all for helping us get there a little quicker. So is David. Applied Semantic Web research does not come much more applied than this, and you can do beautiful things with this baby. I'm tinkering with a few blue sky ideas of my own, and played rather heavily with the Google Spreadsheets integration it just grew this weekend.

If you cared to skim through that page, have or create some Google Spreadsheets (they import comma separated values, for instance) and tag up column titles as it expects, you can head over to my instant exhibit page and try it out right away. It will lack many of the features view customization gives you, but it is still a great place to start. Due to some slight lunacy with the JSONP exported by Google Spreadsheets, so far, the top left column (A:1) isn't present in the feed, so Exhibit, for the time being, decides it reads {label} -- in other words, keep your label property in the first column and settle for having it a string, for the time being.

Google Pages: Finally some decent, large capacity (at least 100 megabytes, anyway) hosting, without a lot of fuss. Even if you, like me, don't care much for their WYSIWYG page builders and templates, there is that neat feature on the right of the page where you can upload your own files and have them show up right on the site right away, whether graphics, javascript, HTML or otherwise. Recommended. For all I know, you can't subdivide your data into directories yet, but to just have somewhere to dump things, you might not need to.

Adsense Black List: For AdSense web masters, this is a cooperative effort to use the AdSense competitive ad filter -- not to avoid linking competitors, but to cut off the "minimum-pay" slack on the low end of the bell curve of earnings per click. A rather good idea to improve outbound links. And for us visitors? Fewer links to made-for-AdSense sites of 100% junk.