function encode_utf8(s) { return unescape(encodeURIComponent(s)); } function decode_utf8(s) { return decodeURIComponent(escape(s)); }2012 Update: Monsur Hossain took a moment to explain how and why this works. It's a good, in-depth post citing all standards in play so you need not bring a wizard's beard to know why it works. Executive summary:
escape
and unescape
operate solely on octets, encoding/decoding %XXs only, while encodeURIComponent
and decodeURIComponent
encode/decode to/from UTF-8, and in addition encode/decode %XXs. The hack above combines both tools to cancel out all but the UTF-8 encoding/decoding parts, which happen inside the heavily optimized browser native code, instead of you pulling the weight in javascript.
Tested and working like a charm in these browsers:
- Win32
- Firefox 1.5.0.6
- Firefox 1.5.0.4
- Internet Explorer 6.0.2900.2180
- Opera 9.0.8502
- MacOS
- Camino 2006061318 (1.0.2)
- Firefox 1.5.0.4
- Safari 2.0.4 (419.3)
And if you hand me your platform/browser combination and the its success/failure status for the tests, I'll try to update the post accordingly.
Nifty!
ReplyDeleteHow about the case when the original text is encoded using ISO-8859-1?
ReplyDeleteThanks!
Alex
Your are in luck! Transforming text in ISO 8859-1 to Unicode is the identity transform (as in no change at all), as the code points they share have the same meaning in both encodings. For all other encodings (save US ASCII, in part a subset ISO 8859-1), you need to resort to laborious replace() hacks.
ReplyDeleteBest solution, thanks.
ReplyDeleteWorked like a charm with:
ReplyDeleteLinux Firefox/2.0.0.2 (Ubuntu-edgy)
Thanks Johan.
ReplyDeleteHowever, on my machine, this does not quite work: the word got encoded as
räksmörgås
Win32, Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.8) Gecko/20071008 Firefox/2.0.0.8
Any idea why?
What did you expect? That is how the UTF8 encoded text is represented when the undecoded UTF8 message is seen as a normal eight-bit string, when your character set is ISO-8859-1, commonly referred to as Latin-1.
ReplyDeleteGot it, thanks.
ReplyDeleteHi Johan
ReplyDeleteThanks for the great little UTF-8 hack :) I've used on my open source tool Hackvertor:-
http://www.businessinfo.co.uk/labs/hackvertor/hackvertor.php
btw this isn't comment spam, Johan gave me permission to plug my tool :)
Just one question... does it matter where you put that code? I'm completely new to javascript, so i don't have any idea.
ReplyDeletePut it between a <script> and </script> tag and you'll be fine.
ReplyDeleteExcellent stuff. I wanted to note that:
ReplyDeletedecode_utf8()
can throw an error. It is probably a good idea to wrap the call in a try...catch
I think I was getting this error when trying to decode UTF-8 when it was really ISO 8859.
Same principles as in all programming apply: don't decode UTF-8 that isn't; don't divide by zero, and so on. Just adding a try/catch will hide errors in input and is inadvisable.
ReplyDeleteHi ,I believe that this method doesn't work for characters like the EURO symbol €. In Firefox I get Malformed URI sequence error
ReplyDeleteIn that case you are doing it wrong; unescape(encodeURIComponent("€")) === "\xE2\x82\xAC" and decodeURIComponent(escape("\xE2\x82\xAC")) === "€" both return true, as they they are supposed to.
ReplyDeleteinternet explorer 8:
ReplyDeleteWin32, Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)
Result of your platform/browser:
ReplyDeleteWin32, Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8 (.NET CLR 3.5.30729)
My $_SERVER['HTTP_USER_AGENT']:
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)
Remarks:
I am running internet explorer version 8.0, which is operating in some sort of compatability mode, and suddenly the executed javascript calls ajax-requests as xmlHttp=new XMLHttpRequest() when before it (iex 6 and 7)) would execute xmlHttp=new ActiveXObject("Msxml2.XMLHTTP") or xmlHttp=new ActiveXObject("Microsoft.XMLHTTP")
take care!
How about uppercase letters?
ReplyDeleteFor me your example räksmörgÃ¥s is correctly decoded into räksmörgås but RÄKSMÖRGÃ…S gives a "malformed URI sequence" error.
That is because the UTF-8 encoding of RÄKSMÖRGÅS is RÃKSMÃRGà S, not RÄKSMÖRGÃ…S. (Any incorrect encoded input sequence will probably give you that error or a similar one.)
ReplyDeleteHmm, well... If RÄKSMÖRGÅS is encoded to utf8 with Javascript, it gets RÄKSMÖRGÃ…S. But if the same word is encoded to utf8 with Java or UltraEdit Text Editor (ASCII to UTF-8), it gets RÄKSMÖRGÃ…S.
ReplyDeleteI am parsing an XML-document encoded as utf8 (the Java-version...) in Javascript.
So, is there two versions of utf8?
Well, somewhere between your UTF-8 encoders and this blog, something is going wrong at least, because „ and – and … (code points 8222, 8211 and 8230 respectively, all way beyond 255) are not 8-bit characters, which every octet in a valid UTF-8 string must be, by definition.
ReplyDeleteMaybe you are sitting on some Windows system doing fancy quotes under your feet, or similar muck. Best of luck with your debugging.
Best solution I've seen so far. The others usually are bitwise operations to get the UTF-8 byte sequences, but don't fully implement the encoding.
ReplyDeleteVery good! Thank you very much. I've been having problems with a facebook connect site where the facebook stream.publish api method was not recognizing the accented characters being sent through a javascript function. I couldn't find any suggestions anywhere until I came across this blog. I applied your suggestion and voilà! problem resolved!
ReplyDeleteFor completeness, it would be nice if you extended the article with a bit of code highlighting that with this, you can create safe urls as well, by percent encoding all the utf8 bits:
ReplyDeletefunction percent_encode( s )
{
utf = encode_utf8( s );
var enc = '';
for( var i = 0; i < utf.length; i++ )
{
enc += '%' + utf.charCodeAt( i ).toString( 16 );
}
return enc;
}
Still now in 2010, this is the only usable search result on solving this specific problem. Have nobody else noticed and written about this?
ReplyDeleteAlso an UPDATE: Works in Chrome --
Win32, Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.1.249.1045 Safari/532.5
Thanks for sharing
ReplyDeletewebsite design professional seo website optimizer logo design
Thanks!
ReplyDeleteWe had been sending UTF-8 data through jQuery which was dealing with it fine in all browsers, until we switched to running our app in a Webkit component embedded in a PyQt application. Still sending UTF-8 from Python but it goes through an implicit "eval()" call, and I was ending up with £ (capital A, circumflex accent, pound sterling symbol) instead of £ (pound sterling symbol).
Popping the UTF-8 strings through the above has fixed this.
This is the QtWebkit 4.6.2.0
Win32, Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB) AppleWebKit/532.4 (KHTML, like Gecko) Qt/4.6.2 Safari/532.4
Thanks again
Thanks for your note on decodeURIComponent. I'm dealing with an XSLT template that includes something like following text
ReplyDelete<a href="javascript.alert('Désolé')>XXX</a>
Now, XSLT is required to produce UTF-8 for Désolé because it's in an anchor, so the alert function then gets an unescaped version of this. My final solution after reading your note: call the following function (LOL) instead:
function alertDecodeURI(text) {
alert(decodeURIComponent(escape(text))); // escape to get back URI encoding
}