2006-08-20

The AdBlock random serial killer mystery

I think for the past year or even years, I have been encountering strange site breakage I've always, more or less subconsciously, incorrectly attributed to site owners. Random broken pictures here and there, typically in galleries, albums and the like. Rarely but frequently enough to give the slightly "tainted" feeling of browsing around a site kept slightly but not perfectly in trim, or that there was some lacking quality assurance in the site's file upload dialog allowing partial uploads, making faulty data format conversions and / or similar. Perhaps one in every few hundred images broken, and only on IIS sites like those mentioned in my last post. Knowing a bit too much about the web, you can often make up lots of plausible explanations for the kind of breakage you encounter once in a while.

But a few days ago, I encountered a piece of breakage that just wouldn't be explained like that, on a community site where site native functionality had been turned off on one profile. You couldn't use the messaging or commenting functionality there, because it was shut down; dropped from available options. Clicking links leading to the profile would mysteriously blow away the entire frame in a way I couldn't even begin to understand; it was all most unfathomable and I couldn't help suspect my own client side hackery; was there any one of my userscripts that could have been behind all of this?

Checking the DOM of these pages, there were indeed the elements that were gone; they had their nodes but were shut out via a CSS display:none; attribute. Surely I hadn't done anything like that in any of my scripts? Well, apart from that odd hack where I put an onerror handler on a few images injected by myself that would drop the image from display if its URL had gone 404 missing. No, that just wouldn't explain it -- and further on, the problem wouldn't go away with the first Greasemonkey debugging tip to try at any time you suspect something like this: clicking the monkey to turn off all Greasemonkey functionality temporarily and reloading the page. Yep, still the same mysterious dissapearances. So the monkey went back on again.

By a stroke of luck, I finally stumbled on the culprit: AdBlock, and more precisely, a very trigger happy regular expression rule fetched by the Filterset.G updater for trashing ads that, by the sheer length of it, looks like it would be a very specific fit indeed only to trigger very specific match criteria:

/[^a-z\d=+%@](?!\d{5,})(\w*\d+x\d)?\d*(show)?(\w{3,}%20|alligator|avs|barter|blog|box|central|context|crystal|d?html|exchange|external|forum|front|fuse|gen|get|house|hover|http|i?frame|inline|instant|live|main|mspace|net|partner|php|popin|primary|provider|realtext|redir\W.*\W|rotated?|secure|side|smart|sponsor|story|text|view|web)?_?ads?(v?(bot|brite|broker|bureau|butler|cent(er|ric)|click|client|content|coun(cil|t(er)?)|creative|cycle|data(id)?|engage|entry|er(tis\w+|t(pro)?|ve?r?)|farm|feelgood|force|form|frame(generator)?|gen|gif|groupid|head|ima?ge?|index|info|js|juggler|layer|legend|link|log|man(ager)?|max|mentor(serve)?|meta\.com|mosaic|net|optimi[sz]er|parser|peeps|pic|po(ol|pup|sition)|proof|q\.nextag|re(dire?c?t?|mote|volver)|rom\.net|rotator|sale|script|search|sdk|sfac|size|so(lution|nar|urce)|stream|space|srv|stat.*\.asp|sys|(tag)?track|trix|type|view|vt|x\.nu|zone))?s?\d*(status)?\d*(?!\.org)[\W_](?!\w+\.(ac\.|edu)|astra|aware|adurl=|block|login|nl/|sears/|.*(&sbc|\.(wmv|rm)))/

A bit of a mouthful, yes. What it does? Well, summarically, it'll snag anything matching the substring "ad", optionally surrounded by one of a busload of words often denoting ads found across the web(*).

For instance, the strings "-AD" or "{AD" will do. That's incidentally a very common thing to find in GUIDs, which IIS servers like to sprinkle all over their URL space. There are five spots in those 38-byte (or sometimes four in 36, when braces are gone) identifiers that each has a one in 256 chance of triggering a hit, assuming a perfect randomness distribution. Some URLs even have two or more GUIDs in them, increasing chances. I'm a bit surprised it doesn't strike more viciously than it does, but it's probably because the randomness distribution isn't anywhere near perfect.

That regexp ate my friends! Down, boy! Baad regexp! Nooo good regexp cookie for you!

http://f.helgon.net/g/{451/{45167369-C843-4D00-AD38-AB2184AF0008}.jpg
http://photo.lunarstorm.se/large/9C4/{9C495852-AD84-4FE9-BB5F-7DAAD05CFB77}.jpg


The issue has apparently already been reported to the Filterset.G people a year ago, though it was deemed unfixable at the time. I submitted a suggested solution, hoping for it to get fixed upstream; tucking in a leading (or trailing; just as long as it doesn't end up a character range qualifier) dash and opening curly brace fixes this particular GUID symptom.

At large, though, this looks like a regexp that evolved from two strict regexps being laxed and "optimised" into one, for instance, or one with a lot of self repetition being made less self repetitive and overly lax in the same blow.

By the looks of this huge regexp, it wants to find and match whatever matches either "something-ad" or "ad-somethingelse" -- a string matching "ad" with any in a set of known prefixes and/or suffixes known to signify an ad. In regexp land, you formulate this as "(prefix-)ad|ad(-suffix)". This will not get false positives for random words not listed among the given pre/suffixes, such as words like "add" or good the guys like "AdAware", or random GUIDs, whereas it might be set to trigger for "livead", "mainad", "adbroker" or "adcontent" for instance, as listed above.

But what this regexp does instead, is to match for the regexp "(prefix-)?ad(-suffix)?", meaning: match "(prefix-)ad(-suffix)", "(prefix-)ad", "ad(-suffix)" or simply "ad", on its own!

Ouch! Given that we're just interested in whether we hit a culprit or not, suddenly that whole slew of carefully recorded words meant to strengthen the regexp against false positives, doesn't! We might just as well throw them away. They of don't do any harm, of course, except wasting a bit of memory and computrons every time the regexp is used, but they don't do anything good either unfortunately.

And worse: by being there at all, as we're (particularly when working with people who are very skilled artisans at what they do) mostly tempted to assume code is there for a purpose, and one that meets the eye, unless explicitly describing the deeper magics hidden below the surface in a nearby comment, they imply being useful for something. Which wastes mind resources that try to improve on it, figuring that I'll just add this additional word that is a known ad to improve the matcher incrementally. Except here it won't achieve squat!

This is a very frequent cause of debugging-induced madness: trying to improve code that has secretly been broken for who knows how long. It's convention to write code that works, so except when in deep bug hunt mode, we read the surface of the code, rather than digging into that huge mindset which reads the entire code chunk structure, filtering all possible input through it to see what happens to it. Especially with long regexps like this.

This is something to take heed with, and a problem that grows more likely to bite you especially as expressions or nesting and complication levels rise in a chunk of code, whichever language you use. Increased complexity comes at the price of a matching decrease in readability, and before you know it, very intricate and hard to find and solve problems creep all over.

While I've given a rough suggestion of the real problem with the above regexp, I can't do any good job of going over how it should look instead, as the complete regexp lists seventeen consecutive conditions to be met, eleven of which are on/off/zero-to-many toggles and I don't know which combinations of those the regexp has been aiming for meeting in combination.

Most likely, though, picking any random one of them that satisfies the above recipe will immensely strengthen the regexp against false positives, just by giving the already provided word lists meaning again. With luck (and I'd be surprised or sad if this is not so), the Filterset.G maintainers have their regexps version controlled with good checkin comments noting what prompted every change, so they can track back to the commit where the two red culprit question marks in the regexp got added (assuming they were not there from the very beginning) to see which parts were meant to go where in the match. And if they were there from the very beginning, it's just to try mending the situation from here as best you can guess.

I believe this story has many lessons to teach about software engineering, and not only the magics of regexpcrafting. Plus, I finally found and slayed the random serial killer that would wreak havoc in the IIS family photo albums! I'm sure we will all sleep better at night now for it. Or maybe not.

(*) Technically, the regexp is a bit further constrained, by not being directly followed by ".org" and a few extensions and other known non-ads, and that it must be followed by another underscore or word character and preceded by a non-alphanumeric, =, +, % or @ character, but by and large, the words listed that make up the lion's share of the regexp, are no-ops, unless you're using it to match and parse out those chunks with them for doing some further string processing with the matches, rather than looking for a boolean "matches!" or "does not match!" result, as is the case here.

2006-08-11

IIS your server well tuned?

For fun and to scratch some web site usability itches, I have been playing with two IIS backed web sites, and while I believe my readership audience doesn't have great overlap with people who run their web sites off IIS, I hope this could be useful to someone who might end up here by way of search engine. It's two tips hoping to improve the world in the very slightest sense, offering best practices thoughts that easily bypasses IIS site developers.

Drop the needless X-Powered-By: ASP.NET header!

With default settings, an IIS server running ASP.NET will happily announce this with every request for a less-than-caring web browser, not only with the HTTP standard Server header, but also via a special X-Powered-By: ASP.NET header, incurring a needless performance hit for every visitor of your site for every request, to the sole purpose of feeding the ego of some Microsoft employee, committee or other source of duhcision making. HTTP is a talkative protocol as it is, but you don't need to feed it additional payload for the sake of proving a point. Turn it off.

URLs are case sensitive!

I know, you don't see this much as an IIS developer, because in Microsoft land, URLs are case insensitive, and to IIS, they are too, probably to trade off worse performance of customer web sites for a lower toll on Microsoft support lines, since paths on Windows file systems have historically been case insensitive and users were expecting the same of URLs.

Anyway, the bad thing with treating URLs as were they case insensitive is that you see no visual indication of anything being wrong when you mix and match versalization of the URLs on your site as you extend it, linking to the "read post" page as /blog/readpost.aspx in one view, /Blog/ReadPost.aspx in another, and most likely also to /blog/ReadPost.aspx and /Blog/readpost.aspx through relative links from the same directory without the leading path segment. (As a visitor follows a URL to /blog/readpost.aspx and that page links adjacent posts via the ReadPost.asp capitalization, on following one, you end up at /blog/ReadPost.aspx.)

Multiply by the number of possible capitalizations offered in the query parameters, i e a random number of variants on UserID, PostID and date, for instance, and you have a huge number of different URLs that all point to the same resource (a specific post, reached through a combinatorially explosive number of same name, different case, aliases).

IIS will not care about the case discrepancy, BUT HTTP DOES.

Why? For two reasons: caching, and browser history. HTTP and your browser agree about URLs and that URLs are case sensitive. When you load /blog/readpost.aspx and /Blog/ReadPost.aspx, they are different URLs and there is no assumption of them being the same document, so there will be no attempt at pulling the second page from your browser's cache entry registered with the first page. Had the same casing been used, the second pageload would see a short 304 Not Modified response (possibly made not-as-short by the above extraneous header), without its page body payload, instantly serving the page from the browser's cache. Great! Less load on your server!

Furthermore, there is browser history. Good web sites employ style sheets (or simply do not serve any styling information at all, leaving that to browser defaults) that differentiate visually between visited links and non-visited links, typically by way of different shades of the link color. Great for usability; it's easier for your visitors to overview what content they have already read through and concentrate on the rest in their further exploration of your site. No need to read yesterday's blog posts again just because you don't recognized the title, when you will surely recall it on seeing the contents, and have wasted your time clicking the link to see it again.

This visualization also can't tell that different URLs are really the same resource, so the same blog post, linked from a view predisposed towards CamelCase, will not be recognized as the one you visited yesterday, which was linked in all lowercase, and the visual cue won't be there. So the view you came from yesterday will say you had read the post, but the view you see today won't. Browse a few posts from today's view and load the view you came through yesterday, and you will only have one post read there. It's all very confusing to your users, but IIS will make sure the pages load, and you won't see that your server draws combinatorially higher bandwidth due to the number of different ways of ending up with the same post than it would have had to, and to save face, your wesigners might just as well end up hiding the visual styling of visited and unvisited links as they just seem to break and confuse rather than help, for some weird reason nobody can quite understand.

But now, at least, you will!

I have been playing with using just user CSS to improve web sites, but this fallacy of random capitalization of links made my "visited" markers (a ✓ prefixed to visited links and non-breaking space characters to fill the same column for unvisited links makes for a great instant overview in vertical lists of what posts are read and not). Users of the Stylish extension might want to have a peek at the source code of my checkmark read posts user CSS, but to be really good with sites like these it probably takes a full user script to case normalize all links on the site first, which I haven't quite gotten to doing yet. I expect downcasing site internal path and query parameter names might make a perfect solution.

I have started seeing these "checkmark for read pages" appear on blogs recently, which is nice. So far I have only seen it as suffixes to links, which isn't as easy to overview in a flat list (as the right edge is rugged) as the variant with making an additional column for the visited / non-visited marking, but it's a good start. And if you don't want to meddle with unicode, you can of course employ a different background for the visited links, perhaps with a nicer still graphic of some kind. I suggest using some amount of left padding to give the image some space, rather than cluttering up the link text with the "background" imagery.

2006-08-02

Date/time input usability

If you have ever designed web input controls for picking a(ny) date and time, you might have been tempted to pick Blogger's approach to it, setting up a sizable array of select boxes, like this:


:


It works. And, as anyone who ever used them to pick a date and time different from the prepopulated one knows, it is painful to use it, even if you don't resort to the click mayhem of doing it keyboard unaidedly. That's six times two clicks plus two or three click+drags to scroll among the options of the larger select boxes plus all the precision mouse manouvering involved with hitting the very small target zones for each and every one of those clicks. Try it with a laptop mouse pad if you find it too easy with a mouse and your many years worth of computing experience. It's just not fun any more. Add the joys of converting 24 hour time to AM/PM if your brain isn't wired to twelve hour time for another optional tinge of discomfort.

What I'm trying to say is that it is not a very friendly approach to achieving the wanted functionality, though it admittedly makes it difficult (but not impossible!) to enter illegal date/time combinations. So, how would one improve upon the situation without sacrificing that criterion? After all, replacing these select boxes with a pre-populated text input reading a default date, say on the format "YYYY-MM-DD hh:mm" (annotated as such), might do away with the click mayhem involved in picking a date and time combination of your own, but it instead becomes easy to make a mistake that would require another iteration of editing, in case that date somehow turned out not quite up to specs. Especially as that format is quite likely not how the user would intuitively phrase a timestamp in writing. Another server roundtrip (for a non-ajax typical implementation) for an "uhm, try again with the date, please?" also does not equal usage bliss.

But the text entry is a great place to start, and for some use cases much better than would a popup calendar date picker be, for instance.

Aside from the specifics of this (short of perfect) visual packaging, try on this Blogger free-form date field user script (direct install link) and head over to your Blogger post editor, where you will now encounter a variant, which leaves the fields to show which date has been picked and lets you type dates by hand in a text field next to the dropdowns, updating the rest of the fields with the data it successfully understood.

While mostly crafted for my needs of half a year ago, when I was importing a lot of blog posts from an external source that listed dates formatted in a textual style I'd want to just paste and be done with, I took some time to make it a bit more useful, understanding lots of partial and complete date formats, in English, French and Swedish (I don't do too well with other languages). It is rather easy to extend it to others, as long as there is no overlap in the names or short forms of months and the relative dates "today" and "yesterday" among the languages listed.

Note how it clues you in to which parts of the date "took" by tinting the appropriate fields greenish. As the Blogger date fields only suggests a few recent years in the year dropdown, I spiced it up to liberally add whichever year you type in yourself, too, after a slight delay, in case it was a typo just as quickly fixed as it was typed. I don't recall the exact details around why I found that safeguard useful any more, but left it in place, in case there was some good thinking behind it. There often is, but I didn't seem to clue in a good enough documentation trail. Anyway, a sample of date and time formats it can handle to play with, to get you started:
1972-04-23
18:05
12/25/2005 (read commentary below if you want D/M/Y dates!)
2006-08-02 14:56
6 Aug, 1996, 04:28
samedi 28 février 2004 19:55 (irrelevant junk around ignored)
Yesterday, 12:31
aujourd'hui
igår

Enjoy!

2006-08-01

Unicode code point bookmarklet

This bookmarklet is for all of you who want to look up the unicode code point for some character you have encountered, or, the other way around, when you have a unicode code point you want the unicode character for.

I fairly often (at least several times per year, sometimes month) find myself in this situation, for some reason or other. Most of the time, it's that I never learned how to convince keyboards to deliver a particular character ("×", for instance), but took my time memorizing its code point, so I'd be able to just type it into my Emacs by typing Control-Q followed by its (octal) character code (327), regardless of whether I was at home, in Sweden, France, San Francisco, on a macintosh, unix or windows machine, or a VT100 terminal, or... ...yes, enough already; they get the point. And occasionally, it's some piece of random Unicode trivia that for some reason stuck, like that the roman numerals start at U+21B0. (Beg your pardon if I poisoned your mind there.)

Anyway, browser bookmarks are about as good and trusty as Emacsen, even if you sometimes have to look them up when in an unfriendly environment (unless you're already using Google browser sync or similar friendly tools to bring your browser environment with you wherever you go). The above bookmarklet is crafted to integrate nicely with Firefox's bookmark keywords, so you can stash it away somewhere deep in the bookmarks hierarchy to get lost in peace. Assuming you gave the bookmark the keyword "char", type "char 64" into your address bar when, say, you forgot where they hid the @ sign in the Swedish macintosh locale, and thus produce the wanted character anyway via a trip over the clipboard. I love the clipboard. (Just clicking the bookmarklet, or invoking it without a parameter, will ask for the character or character code instead.)

Just typing a number verbatim treats it as decimal, prefix it with a zero to mark it as octal (0327 for the multiplication sign), or 0x or U+ for hexadecimal, as in 0x216B for the roman numeral twelve, (yes, U+2160 isn't zero but , since the good Romans didn't feel much need for any zeroes).

And, of course, pasting the one-off 愛 猫 ♀ ❤ useful character into it gives you the brain bugs that prove oh, so useful when you're in a poor input locale with a mindful of numbers that hog your brain. I'm sure it happens all the time!