ecmanaut

The author

2006-07-11, 00:47

Encoding / decoding UTF8 in javascript

From time to time it has somewhat annoyed me that UTF8 (today's most common Unicode transport encoding, recommended by the IETF) conversion is not readily available in browser javascript. It really is, though, I realized today:
function encode_utf8( s )
{
return unescape( encodeURIComponent( s ) );
}

function decode_utf8( s )
{
return decodeURIComponent( escape( s ) );
}
Tested and working like a charm in these browsers:
Win32
  • Firefox 1.5.0.6
  • Firefox 1.5.0.4
  • Internet Explorer 6.0.2900.2180
  • Opera 9.0.8502
MacOS
  • Camino 2006061318 (1.0.2)
  • Firefox 1.5.0.4
  • Safari 2.0.4 (419.3)
Any modern standards compliant browser should handle this code, though, so don't worry that it's a rather sparse test matrix. But feel free to use my test case: encoding and decoding the word "räksmörgås". That's incidentally Swedish for a shrimp sandwich, by the way -- very good subject matter indeed.

And if you hand me your platform/browser combination and the its success/failure status for the tests, I'll try to update the post accordingly.

22 Comment:

  • Nifty!

    By Anonymous Thomas Frank, on Thu Aug 03, 01:29:00 AM CEST  

  • How about the case when the original text is encoded using ISO-8859-1?

    Thanks!

    Alex

    By Blogger Alex Iskold, on Tue Aug 15, 12:22:00 AM CEST  

  • Your are in luck! Transforming text in ISO 8859-1 to Unicode is the identity transform (as in no change at all), as the code points they share have the same meaning in both encodings. For all other encodings (save US ASCII, in part a subset ISO 8859-1), you need to resort to laborious replace() hacks.

    By Blogger Johan Sundström, on Tue Aug 15, 12:55:00 AM CEST  

  • Best solution, thanks.

    By Anonymous Andrej, on Thu Feb 22, 05:30:00 PM CET  

  • Worked like a charm with:
    Linux Firefox/2.0.0.2 (Ubuntu-edgy)

    By Anonymous Thomas Langvann, on Tue Mar 06, 01:48:00 PM CET  

  • Thanks Johan.
    However, on my machine, this does not quite work: the word got encoded as
    räksmörgås

    Win32, Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.8) Gecko/20071008 Firefox/2.0.0.8

    Any idea why?

    By Blogger Riši, on Sat Oct 27, 12:51:00 PM CEST  

  • What did you expect? That is how the UTF8 encoded text is represented when the undecoded UTF8 message is seen as a normal eight-bit string, when your character set is ISO-8859-1, commonly referred to as Latin-1.

    By Blogger Johan Sundström, on Sat Oct 27, 03:34:00 PM CEST  

  • Got it, thanks.

    By Blogger Riši, on Sat Oct 27, 04:30:00 PM CEST  

  • Hi Johan

    Thanks for the great little UTF-8 hack :) I've used on my open source tool Hackvertor:-
    http://www.businessinfo.co.uk/labs/hackvertor/hackvertor.php

    btw this isn't comment spam, Johan gave me permission to plug my tool :)

    By Anonymous Anonymous, on Tue Feb 26, 12:55:00 PM CET  

  • Just one question... does it matter where you put that code? I'm completely new to javascript, so i don't have any idea.

    By Anonymous Anonymous, on Sat Mar 15, 09:10:00 PM CET  

  • Put it between a <script> and </script> tag and you'll be fine.

    By Blogger Johan Sundström, on Sun Mar 16, 09:22:00 AM CET  

  • Excellent stuff. I wanted to note that:
    decode_utf8()
    can throw an error. It is probably a good idea to wrap the call in a try...catch

    I think I was getting this error when trying to decode UTF-8 when it was really ISO 8859.

    By Blogger scott, on Sun Sep 07, 09:47:00 PM CEST  

  • Same principles as in all programming apply: don't decode UTF-8 that isn't; don't divide by zero, and so on. Just adding a try/catch will hide errors in input and is inadvisable.

    By Blogger Johan Sundström, on Sun Sep 07, 10:35:00 PM CEST  

  • Hi ,I believe that this method doesn't work for characters like the EURO symbol €. In Firefox I get Malformed URI sequence error

    By Anonymous Technics, on Thu Mar 12, 03:20:00 PM CET  

  • In that case you are doing it wrong; unescape(encodeURIComponent("€")) === "\xE2\x82\xAC" and decodeURIComponent(escape("\xE2\x82\xAC")) === "€" both return true, as they they are supposed to.

    By Blogger Johan Sundström, on Thu Mar 12, 09:11:00 PM CET  

  • internet explorer 8:
    Win32, Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)

    By Anonymous Anonymous, on Sat May 02, 09:16:00 AM CEST  

  • Result of your platform/browser:
    Win32, Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8 (.NET CLR 3.5.30729)

    My $_SERVER['HTTP_USER_AGENT']:
    Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)

    Remarks:
    I am running internet explorer version 8.0, which is operating in some sort of compatability mode, and suddenly the executed javascript calls ajax-requests as xmlHttp=new XMLHttpRequest() when before it (iex 6 and 7)) would execute xmlHttp=new ActiveXObject("Msxml2.XMLHTTP") or xmlHttp=new ActiveXObject("Microsoft.XMLHTTP")

    take care!

    By Anonymous Anonymous, on Sun May 03, 02:28:00 AM CEST  

  • How about uppercase letters?

    For me your example räksmörgÃ¥s is correctly decoded into räksmörgås but RÄKSMÖRGÃ…S gives a "malformed URI sequence" error.

    By Anonymous Conny, on Thu Jun 04, 08:06:00 AM CEST  

  • That is because the UTF-8 encoding of RÄKSMÖRGÅS is RÄKSMÖRGÅS, not RÄKSMÖRGÃ…S. (Any incorrect encoded input sequence will probably give you that error or a similar one.)

    By Blogger Johan Sundström, on Thu Jun 04, 09:10:00 PM CEST  

  • Hmm, well... If RÄKSMÖRGÅS is encoded to utf8 with Javascript, it gets RÄKSMÖRGÃ…S. But if the same word is encoded to utf8 with Java or UltraEdit Text Editor (ASCII to UTF-8), it gets RÄKSMÖRGÃ…S.

    I am parsing an XML-document encoded as utf8 (the Java-version...) in Javascript.

    So, is there two versions of utf8?

    By Anonymous Conny, on Fri Jun 05, 05:19:00 PM CEST  

  • Well, somewhere between your UTF-8 encoders and this blog, something is going wrong at least, because „ and – and … (code points 8222, 8211 and 8230 respectively, all way beyond 255) are not 8-bit characters, which every octet in a valid UTF-8 string must be, by definition.

    Maybe you are sitting on some Windows system doing fancy quotes under your feet, or similar muck. Best of luck with your debugging.

    By Blogger Johan Sundström, on Fri Jun 05, 09:07:00 PM CEST  

  • Best solution I've seen so far. The others usually are bitwise operations to get the UTF-8 byte sequences, but don't fully implement the encoding.

    By Anonymous bucabay, on Mon Jun 08, 02:02:00 PM CEST  

Post a Comment

http://ecmanaut.blogspot.com/2006/07/encoding-decoding-utf8-in-javascript.html