2006-07-10

Encoding / decoding UTF8 in javascript

From time to time it has somewhat annoyed me that UTF8 (today's most common Unicode transport encoding, recommended by the IETF) conversion is not readily available in browser javascript. It really is, though, I realized today:
function encode_utf8(s) {
  return unescape(encodeURIComponent(s));
}

function decode_utf8(s) {
  return decodeURIComponent(escape(s));
}
2012 Update: Monsur Hossain took a moment to explain how and why this works. It's a good, in-depth post citing all standards in play so you need not bring a wizard's beard to know why it works. Executive summary: escape and unescape operate solely on octets, encoding/decoding %XXs only, while encodeURIComponent and decodeURIComponent encode/decode to/from UTF-8, and in addition encode/decode %XXs. The hack above combines both tools to cancel out all but the UTF-8 encoding/decoding parts, which happen inside the heavily optimized browser native code, instead of you pulling the weight in javascript.

Tested and working like a charm in these browsers:
Win32
  • Firefox 1.5.0.6
  • Firefox 1.5.0.4
  • Internet Explorer 6.0.2900.2180
  • Opera 9.0.8502
MacOS
  • Camino 2006061318 (1.0.2)
  • Firefox 1.5.0.4
  • Safari 2.0.4 (419.3)
Any modern standards compliant browser should handle this code, though, so don't worry that it's a rather sparse test matrix. But feel free to use my test case: encoding and decoding the word "räksmörgås". That's incidentally Swedish for a shrimp sandwich, by the way -- very good subject matter indeed.

And if you hand me your platform/browser combination and the its success/failure status for the tests, I'll try to update the post accordingly.
Categories:

28 comments:

  1. How about the case when the original text is encoded using ISO-8859-1?

    Thanks!

    Alex

    ReplyDelete
  2. Your are in luck! Transforming text in ISO 8859-1 to Unicode is the identity transform (as in no change at all), as the code points they share have the same meaning in both encodings. For all other encodings (save US ASCII, in part a subset ISO 8859-1), you need to resort to laborious replace() hacks.

    ReplyDelete
  3. Best solution, thanks.

    ReplyDelete
  4. Worked like a charm with:
    Linux Firefox/2.0.0.2 (Ubuntu-edgy)

    ReplyDelete
  5. Thanks Johan.
    However, on my machine, this does not quite work: the word got encoded as
    räksmörgås

    Win32, Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.8) Gecko/20071008 Firefox/2.0.0.8

    Any idea why?

    ReplyDelete
  6. What did you expect? That is how the UTF8 encoded text is represented when the undecoded UTF8 message is seen as a normal eight-bit string, when your character set is ISO-8859-1, commonly referred to as Latin-1.

    ReplyDelete
  7. Hi Johan

    Thanks for the great little UTF-8 hack :) I've used on my open source tool Hackvertor:-
    http://www.businessinfo.co.uk/labs/hackvertor/hackvertor.php

    btw this isn't comment spam, Johan gave me permission to plug my tool :)

    ReplyDelete
  8. Just one question... does it matter where you put that code? I'm completely new to javascript, so i don't have any idea.

    ReplyDelete
  9. Put it between a <script> and </script> tag and you'll be fine.

    ReplyDelete
  10. Excellent stuff. I wanted to note that:
    decode_utf8()
    can throw an error. It is probably a good idea to wrap the call in a try...catch

    I think I was getting this error when trying to decode UTF-8 when it was really ISO 8859.

    ReplyDelete
  11. Same principles as in all programming apply: don't decode UTF-8 that isn't; don't divide by zero, and so on. Just adding a try/catch will hide errors in input and is inadvisable.

    ReplyDelete
  12. Hi ,I believe that this method doesn't work for characters like the EURO symbol €. In Firefox I get Malformed URI sequence error

    ReplyDelete
  13. In that case you are doing it wrong; unescape(encodeURIComponent("€")) === "\xE2\x82\xAC" and decodeURIComponent(escape("\xE2\x82\xAC")) === "€" both return true, as they they are supposed to.

    ReplyDelete
  14. internet explorer 8:
    Win32, Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)

    ReplyDelete
  15. Result of your platform/browser:
    Win32, Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8 (.NET CLR 3.5.30729)

    My $_SERVER['HTTP_USER_AGENT']:
    Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)

    Remarks:
    I am running internet explorer version 8.0, which is operating in some sort of compatability mode, and suddenly the executed javascript calls ajax-requests as xmlHttp=new XMLHttpRequest() when before it (iex 6 and 7)) would execute xmlHttp=new ActiveXObject("Msxml2.XMLHTTP") or xmlHttp=new ActiveXObject("Microsoft.XMLHTTP")

    take care!

    ReplyDelete
  16. How about uppercase letters?

    For me your example räksmörgÃ¥s is correctly decoded into räksmörgås but RÄKSMÖRGÃ…S gives a "malformed URI sequence" error.

    ReplyDelete
  17. That is because the UTF-8 encoding of RÄKSMÖRGÅS is RÄKSMÖRGÃ…S, not RÄKSMÖRGÃ…S. (Any incorrect encoded input sequence will probably give you that error or a similar one.)

    ReplyDelete
  18. Hmm, well... If RÄKSMÖRGÅS is encoded to utf8 with Javascript, it gets RÄKSMÖRGÃ…S. But if the same word is encoded to utf8 with Java or UltraEdit Text Editor (ASCII to UTF-8), it gets RÄKSMÖRGÃ…S.

    I am parsing an XML-document encoded as utf8 (the Java-version...) in Javascript.

    So, is there two versions of utf8?

    ReplyDelete
  19. Well, somewhere between your UTF-8 encoders and this blog, something is going wrong at least, because „ and – and … (code points 8222, 8211 and 8230 respectively, all way beyond 255) are not 8-bit characters, which every octet in a valid UTF-8 string must be, by definition.

    Maybe you are sitting on some Windows system doing fancy quotes under your feet, or similar muck. Best of luck with your debugging.

    ReplyDelete
  20. Best solution I've seen so far. The others usually are bitwise operations to get the UTF-8 byte sequences, but don't fully implement the encoding.

    ReplyDelete
  21. Very good! Thank you very much. I've been having problems with a facebook connect site where the facebook stream.publish api method was not recognizing the accented characters being sent through a javascript function. I couldn't find any suggestions anywhere until I came across this blog. I applied your suggestion and voilà! problem resolved!

    ReplyDelete
  22. For completeness, it would be nice if you extended the article with a bit of code highlighting that with this, you can create safe urls as well, by percent encoding all the utf8 bits:

    function percent_encode( s )
    {
    utf = encode_utf8( s );
    var enc = '';
    for( var i = 0; i < utf.length; i++ )
    {
    enc += '%' + utf.charCodeAt( i ).toString( 16 );
    }
    return enc;
    }

    ReplyDelete
  23. Still now in 2010, this is the only usable search result on solving this specific problem. Have nobody else noticed and written about this?

    Also an UPDATE: Works in Chrome --
    Win32, Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.1.249.1045 Safari/532.5

    ReplyDelete
  24. Thanks!

    We had been sending UTF-8 data through jQuery which was dealing with it fine in all browsers, until we switched to running our app in a Webkit component embedded in a PyQt application. Still sending UTF-8 from Python but it goes through an implicit "eval()" call, and I was ending up with £ (capital A, circumflex accent, pound sterling symbol) instead of £ (pound sterling symbol).

    Popping the UTF-8 strings through the above has fixed this.

    This is the QtWebkit 4.6.2.0

    Win32, Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB) AppleWebKit/532.4 (KHTML, like Gecko) Qt/4.6.2 Safari/532.4

    Thanks again

    ReplyDelete
  25. Thanks for your note on decodeURIComponent. I'm dealing with an XSLT template that includes something like following text

    <a href="javascript.alert('Désolé')>XXX</a>

    Now, XSLT is required to produce UTF-8 for Désolé because it's in an anchor, so the alert function then gets an unescaped version of this. My final solution after reading your note: call the following function (LOL) instead:

    function alertDecodeURI(text) {
    alert(decodeURIComponent(escape(text))); // escape to get back URI encoding
    }

    ReplyDelete

Limited HTML (such as <b>, <i>, <a>) is supported. (All comments are moderated by me amd rel=nofollow gets added to links -- to deter and weed out monetized spam.)

I would prefer not to have to do this as much as you do. Comments straying too far off the post topic often lost due to attention dilution.

Note: Only a member of this blog may post a comment.