2016-12-15

Control Freak

After Google Chrome discontinued native user script support in the shipping default build, I ran TamperMonkey for a while, a pretty full featured extension similar to Firefox's Greasemonkey for my user scripting.

And then at some point, I stumbled upon Control Freak, a minimalist solution for quick hacks and tweaks with no support for sharing those hacks, but swift access to making a script for the current page, domain, or the whole web.

And despite being really fond of sharing the tools I build, I liked it, and have stuck with it since. But I will at least make one little hack available for it for making it easy to do on.js style scraping from chrome devtools by way of a Control Freak user script.

For now, I resort to digging up the link to the current .user.js version from https://git.io/on.js and pasting that into the "lib" field, and then write myself a little scraper definition – here is one for https://tumblr.com/dashboard that just parses out basic metadata from the posts fetched in the initial pageload:

scrape(
[ 'css* .post_info_link'
, { node: 'xpath .'
  , data: 'xpath string(@data-tumblelog-popover)'
  , link: 'xpath string(@href)'
  , author: 'xpath string(.)'
  }
]);

From then on, each time I go to that page, my chrome console will expose the on function on the page, the scraper definition above, and the array of scraped items (per the css* contextual selector, digging up an array of 0 or more parsed .post_info_link elements from the page, digging out the four sub-bits for each).

If I want to actually do something useful with them in the script, that's easy sleight of hand, too, but oftentimes, I just want programmatic access to parts of the page. I can be bothered to write up a scraper definition once, if it's easy to save it and reap the benefits of it on later visits, and this lets me do just that, and evolve it over time, with a minimum of fuss.

If the syntax looks foreign to you (it probably does), it's a json literal coded version of an unholy mix of regexp, css, and xpath syntax idioms, for scraping a templated web page into a javascript data structure you devise the format of yourself, a later evolution of ideas I posted to greasemonkey-dev in 2007 / 2012.

You can think of it as a more powerful version of document.querySelector, document.querySelectorAll, and document.evaluate, and lots of boring iteration code to create the objects/arrays you seek, rolled into one, with a unified/bridged syntax.

Some base-bones examples:

on.dom('css? body') = document.querySelector('body') = document.body
on.dom('css* img') = document.querySelectorAll('img') ~ document.images (but as a proper Array)
// or, equivalent, but with xpath selectors:
on.dom('xpath? //body')
on.dom('xpath* //img')

The ? and * qualifiers borrow regexp semantics – ? either you finds one node or produces a null, * always produces an Array of 0 or more elements, whether the expression matched or not. For cases where you don't want the expression to yield anything unless you had one or more matches, similarly nothing and + force a match for the selector:

on.dom('css h1') = document.querySelector('h1')
on.dom('css+ img') = document.querySelectorAll('img') ~ document.images (but as a proper Array)
// or, equivalent, but with xpath selectors:
on.dom('xpath //h1')
on.dom('xpath+ //img') = devtools' $x('//img')
To create an object with some named things in it, just declare the shape you want:
on.dom(
{ body: 'css body')
, images: 'css+ img')
});
Let's say you want all links in the page with an image in them:
on.dom('xpath+ //a[@href][count(.//img) = 1]')
But instead of the img elements, we want their src attribute, and the href attribute of the link. This is where context selectors come into play (an array literal with the context selector first, followed by the object structure to create for each item):
on.dom(
[ 'xpath+ //a[@href][count(.//img) = 1]'
, { href: 'xpath string(@href)'
  , src: 'xpath string(.//img/@src)'
  }
])
For a page with three such elements, this might yield you something like:
[ { href: '/', src: '/homepage.png'}
, { href: '/?page=1', src: '/prev.png'}
, { href: '/?page=3', src: '/next.png'}
]
The context selector lets us drill into the page, and decode properties relative to each node, with much simpler selectors local to each instance in the page. If we want the a and img elements in our data structure too, that's an easy addition:
on.dom(
[ 'xpath+ //a[@href][count(.//img) = 1]'
, { href: 'xpath string(@href)'
  , src: 'xpath string(.//img/@src)'
  , img: 'xpath .//img'
  , a: 'xpath .'
  }
])

All leaves in the structure ('<selectorType><arity> <selector>') you want to drill out sub-properties for, can be similarly replaced by a ['contextSelector', {subPropertiesSpec}] this way, which makes decomposing deeply nested templated pages comfy.

Put another way: this is for web pages what regexps with named match groups are to strings of text, but with tree structured output, as web pages are tree structured instead of one-dimensional. And I think it makes user scripting a whole lot more fun.

2016-11-22

Unchoking

I have almost stopped writing about tech stuff in recent years, despite web APIs and javascript features catching up with sanity ever faster. What used to be a very horrible hack a few back to fetch a web page and produce a DOM you could query from it is at the moment both pretty readable and understandable:
const wget = async(url) => {
  try {
    const res = await fetch(url), html = await res.text();
    return (new DOMParser).parseFromString(html, 'text/html');
  } catch (e) {
    console.error(`Failed to parse ${url} as HTML`, e);
  }
};

wget('/').then(doc => alert(doc.title));
This already works in a current Google Chrome Canary (57). Sadly no javascript console support for doc = await wget('/'); you still have to use the "raw" promise API directly for interactive code, rather than syntax sugared blocking behaviour – but it's still a lot nicer than things used to be. And you can of course assign globals and echo when it's done:
wget('/').then(d => console.log(window.doc = d));
doc.title;
Querying a DOM with an XPath selector and optional context node is still as ugly as it always was (somehow only the first making it to the Chrome js console):
const $x = (xpath, root) => {
  const doc = root ? root.evaluate ? root : root.ownerDocument : document;
  const got = doc.evaluate(xpath, root||doc, null, 0, null);
  switch (got.resultType) {
    case got.STRING_TYPE:  return got.stringValue;
    case got.NUMBER_TYPE:  return got.numberValue;
    case got.BOOLEAN_TYPE: return got.booleanValue;
    default:
      let res = [], next;
      while ((next = got.iterateNext())) {
        res.push(next);
      }
      return res;
  }
};

const $X = (xpath, root) => Array.prototype.concat($x(xpath, root))[0];
...but for the corresponding css selector utilities ($$ and $ respectively), we can now say document.querySelectorAll() and document.querySelector(), respectively. Nice-to-haves. Like the lexically bound arrow functions. I guess web page crafters overall rarely if ever use XPath, and that it is an XML vestige we should be really happy that we have at all, through happy accidents of history, even though it needs a bit of shimming love to become lovable.