ecmanaut: Encoding / decoding UTF8 in javascript

2006-07-10

Encoding / decoding UTF8 in javascript

From time to time it has somewhat annoyed me that UTF8 (today's most common Unicode transport encoding, recommended by the IETF) conversion is not readily available in browser javascript. It really is, though, I realized today:

function encode_utf8(s) {
  return unescape(encodeURIComponent(s));
}

function decode_utf8(s) {
  return decodeURIComponent(escape(s));
}

2012 Update: Monsur Hossain took a moment to explain how and why this works. It's a good, in-depth post citing all standards in play so you need not bring a wizard's beard to know why it works. Executive summary: escape and unescape operate solely on octets, encoding/decoding %XXs only, while encodeURIComponent and decodeURIComponent encode/decode to/from UTF-8, and in addition encode/decode %XXs. The hack above combines both tools to cancel out all but the UTF-8 encoding/decoding parts, which happen inside the heavily optimized browser native code, instead of you pulling the weight in javascript.

Tested and working like a charm in these browsers:

Win32

Firefox 1.5.0.6
Firefox 1.5.0.4
Internet Explorer 6.0.2900.2180
Opera 9.0.8502

MacOS

Camino 2006061318 (1.0.2)
Firefox 1.5.0.4
Safari 2.0.4 (419.3)

Any modern standards compliant browser should handle this code, though, so don't worry that it's a rather sparse test matrix. But feel free to use my test case: encoding and decoding the word "räksmörgås". That's incidentally Swedish for a shrimp sandwich, by the way -- very good subject matter indeed.

And if you hand me your platform/browser combination and the its success/failure status for the tests, I'll try to update the post accordingly.

Categories:

28 comments:

AnonymousWed Aug 02, 04:29:00 PM PDT
Nifty!
ReplyDelete
Replies
Alex IskoldMon Aug 14, 03:22:00 PM PDT
How about the case when the original text is encoded using ISO-8859-1?

Thanks!

Alex
ReplyDelete
Replies
Johan SundströmMon Aug 14, 03:55:00 PM PDT
Your are in luck! Transforming text in ISO 8859-1 to Unicode is the identity transform (as in no change at all), as the code points they share have the same meaning in both encodings. For all other encodings (save US ASCII, in part a subset ISO 8859-1), you need to resort to laborious replace() hacks.
ReplyDelete
Replies
AnonymousThu Feb 22, 08:30:00 AM PST
Best solution, thanks.
ReplyDelete
Replies
AnonymousTue Mar 06, 04:48:00 AM PST
Worked like a charm with:
Linux Firefox/2.0.0.2 (Ubuntu-edgy)
ReplyDelete
Replies
RišiSat Oct 27, 03:51:00 AM PDT
Thanks Johan.
However, on my machine, this does not quite work: the word got encoded as
rÃ¤ksmÃ¶rgÃ¥s

Win32, Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.8) Gecko/20071008 Firefox/2.0.0.8

Any idea why?
ReplyDelete
Replies
Johan SundströmSat Oct 27, 06:34:00 AM PDT
What did you expect? That is how the UTF8 encoded text is represented when the undecoded UTF8 message is seen as a normal eight-bit string, when your character set is ISO-8859-1, commonly referred to as Latin-1.
ReplyDelete
Replies
RišiSat Oct 27, 07:30:00 AM PDT
Got it, thanks.
ReplyDelete
Replies
AnonymousTue Feb 26, 03:55:00 AM PST
Hi Johan

Thanks for the great little UTF-8 hack :) I've used on my open source tool Hackvertor:-
http://www.businessinfo.co.uk/labs/hackvertor/hackvertor.php

btw this isn't comment spam, Johan gave me permission to plug my tool :)
ReplyDelete
Replies
AnonymousSat Mar 15, 01:10:00 PM PDT
Just one question... does it matter where you put that code? I'm completely new to javascript, so i don't have any idea.
ReplyDelete
Replies
Johan SundströmSun Mar 16, 01:22:00 AM PDT
Put it between a <script> and </script> tag and you'll be fine.
ReplyDelete
Replies
scottSun Sep 07, 12:47:00 PM PDT
Excellent stuff. I wanted to note that:
decode_utf8()
can throw an error. It is probably a good idea to wrap the call in a try...catch

I think I was getting this error when trying to decode UTF-8 when it was really ISO 8859.
ReplyDelete
Replies
Johan SundströmSun Sep 07, 01:35:00 PM PDT
Same principles as in all programming apply: don't decode UTF-8 that isn't; don't divide by zero, and so on. Just adding a try/catch will hide errors in input and is inadvisable.
ReplyDelete
Replies
AnonymousThu Mar 12, 07:20:00 AM PDT
Hi ,I believe that this method doesn't work for characters like the EURO symbol €. In Firefox I get Malformed URI sequence error
ReplyDelete
Replies
Johan SundströmThu Mar 12, 01:11:00 PM PDT
In that case you are doing it wrong; unescape(encodeURIComponent("€")) === "\xE2\x82\xAC" and decodeURIComponent(escape("\xE2\x82\xAC")) === "€" both return true, as they they are supposed to.
ReplyDelete
Replies
AnonymousSat May 02, 12:16:00 AM PDT
internet explorer 8:
Win32, Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)
ReplyDelete
Replies
AnonymousSat May 02, 05:28:00 PM PDT
Result of your platform/browser:
Win32, Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8 (.NET CLR 3.5.30729)

My $_SERVER['HTTP_USER_AGENT']:
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)

Remarks:
I am running internet explorer version 8.0, which is operating in some sort of compatability mode, and suddenly the executed javascript calls ajax-requests as xmlHttp=new XMLHttpRequest() when before it (iex 6 and 7)) would execute xmlHttp=new ActiveXObject("Msxml2.XMLHTTP") or xmlHttp=new ActiveXObject("Microsoft.XMLHTTP")

take care!
ReplyDelete
Replies
ConnyWed Jun 03, 11:06:00 PM PDT
How about uppercase letters?

For me your example rÃ¤ksmÃ¶rgÃ¥s is correctly decoded into räksmörgås but RÃ„KSMÃ–RGÃ…S gives a "malformed URI sequence" error.
ReplyDelete
Replies
Johan SundströmThu Jun 04, 12:10:00 PM PDT
That is because the UTF-8 encoding of RÄKSMÖRGÅS is RÃ„KSMÃ–RGÃ…S, not RÃ„KSMÃ–RGÃ…S. (Any incorrect encoded input sequence will probably give you that error or a similar one.)
ReplyDelete
Replies
ConnyFri Jun 05, 08:19:00 AM PDT
Hmm, well... If RÄKSMÖRGÅS is encoded to utf8 with Javascript, it gets RÃ„KSMÃ–RGÃ…S. But if the same word is encoded to utf8 with Java or UltraEdit Text Editor (ASCII to UTF-8), it gets RÃ„KSMÃ–RGÃ…S.

I am parsing an XML-document encoded as utf8 (the Java-version...) in Javascript.

So, is there two versions of utf8?
ReplyDelete
Replies
Johan SundströmFri Jun 05, 12:07:00 PM PDT
Well, somewhere between your UTF-8 encoders and this blog, something is going wrong at least, because „ and – and … (code points 8222, 8211 and 8230 respectively, all way beyond 255) are not 8-bit characters, which every octet in a valid UTF-8 string must be, by definition.

Maybe you are sitting on some Windows system doing fancy quotes under your feet, or similar muck. Best of luck with your debugging.
ReplyDelete
Replies
bucabayMon Jun 08, 05:02:00 AM PDT
Best solution I've seen so far. The others usually are bitwise operations to get the UTF-8 byte sequences, but don't fully implement the encoding.
ReplyDelete
Replies
UnknownFri Feb 05, 05:58:00 PM PST
Very good! Thank you very much. I've been having problems with a facebook connect site where the facebook stream.publish api method was not recognizing the accented characters being sent through a javascript function. I couldn't find any suggestions anywhere until I came across this blog. I applied your suggestion and voilà! problem resolved!
ReplyDelete
Replies
UnknownThu Mar 18, 12:11:00 AM PDT
For completeness, it would be nice if you extended the article with a bit of code highlighting that with this, you can create safe urls as well, by percent encoding all the utf8 bits:

function percent_encode( s )
{
utf = encode_utf8( s );
var enc = '';
for( var i = 0; i < utf.length; i++ )
{
enc += '%' + utf.charCodeAt( i ).toString( 16 );
}
return enc;
}
ReplyDelete
Replies
SimonWed Apr 14, 09:22:00 AM PDT
Still now in 2010, this is the only usable search result on solving this specific problem. Have nobody else noticed and written about this?

Also an UPDATE: Works in Chrome --
Win32, Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.1.249.1045 Safari/532.5
ReplyDelete
Replies
AliceTue Jul 13, 10:10:00 PM PDT
Thanks for sharing

website design professional seo website optimizer logo design
ReplyDelete
Replies
PaulThu Aug 05, 01:47:00 AM PDT
Thanks!

We had been sending UTF-8 data through jQuery which was dealing with it fine in all browsers, until we switched to running our app in a Webkit component embedded in a PyQt application. Still sending UTF-8 from Python but it goes through an implicit "eval()" call, and I was ending up with Â£ (capital A, circumflex accent, pound sterling symbol) instead of £ (pound sterling symbol).

Popping the UTF-8 strings through the above has fixed this.

This is the QtWebkit 4.6.2.0

Win32, Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB) AppleWebKit/532.4 (KHTML, like Gecko) Qt/4.6.2 Safari/532.4

Thanks again
ReplyDelete
Replies
shdanfoFri Aug 06, 01:55:00 PM PDT
Thanks for your note on decodeURIComponent. I'm dealing with an XSLT template that includes something like following text

<a href="javascript.alert('Désolé')>XXX</a>

Now, XSLT is required to produce UTF-8 for Désolé because it's in an anchor, so the alert function then gets an unescaped version of this. My final solution after reading your note: call the following function (LOL) instead:

function alertDecodeURI(text) {
alert(decodeURIComponent(escape(text))); // escape to get back URI encoding
}
ReplyDelete
Replies

Add comment

Limited HTML (such as <b>, <i>, <a>) is supported. (All comments are moderated by me amd rel=nofollow gets added to links -- to deter and weed out monetized spam.)

I would prefer not to have to do this as much as you do. Comments straying too far off the post topic often lost due to attention dilution.

Note: Only a member of this blog may post a comment.