0

Unicode, URL, Mojibake

I make a request of our servers using the URL 'class' like this...

CODE
    url.setRequestHeader("Accept-Charset", "utf-8");
url.fetchAsync(function (url2) {
log("jjj TVL4.db.get_connections_internal url.response: " + url.response);
log("jjj TVL4.db.get_connections_internal url.response headers:\n" +
"Content-Type: " + url2.getResponseHeaders("Content-Type") + "\n" +
"Vary: " + url2.getResponseHeaders("Vary"));
// ...
temp = JSON.parse(url.result);
Array.prototype.map.call(temp, function (c) {
log("jjj name: " + c.name);
});


Notice I specify UTF-8 in Accept-Charset. Also, from my debug logs, it seems the server is replying with UTF-8 character set.

CODE
WR 00:00:23:958: [T:3026] FINISHED 4.95 KB IN 0.14 s (36.64 KB/s) [http://api.splat.tv/v03/connections.jsp?lod=all&episode=848]
WM 00:00:23:958: [T:3026] jjj TVL4.db.get_connections_internal url.response: 200
WM 00:00:23:959: [T:3026] jjj TVL4.db.get_connections_internal url.response headers:
Content-Type: text/html;charset=UTF-8
Vary: Accept-Encoding
WM 00:00:23:959: [T:3026] jjj name: A Sleep Be Told
WM 00:00:23:959: [T:3026] jjj name: Enquête sur un citoyen au-dessus de tout soupçon
WM 00:00:23:960: [T:3026] jjj name: Faces In The Dark
WM 00:00:23:960: [T:3026] jjj name: Living a Lie


Never the less, the string I get in Javascript has been parsed from the byte-stream as an 8859-1 encoded text, not UTF-8. As a result instead of ê (e with a ^ accent) I am getting ê (<A with ~><super script underlined a>). This is the expected result of parsing a UTF-8 stream as 8859-1. The byte stream is C3 AA, which is the two 8859-1 characters ê (<A with ~><super script underlined a>), and also the single UTF-8 character ê (e with a ^ accent).

How do I make the TV parse my server's reply as one encoded in UTF-8?

by
7 Replies
  • I just noticed I was mixing the use of url, bound in the outer function, and url2 bound in the inner function. Correcting the code to use only the inner url2 causes no change in observed behavior. Here is the current code:

    CODE
        url.setRequestHeader("Accept-Charset", "utf-8");
    url.fetchAsync(function (url2) {
    log("jjj TVL4.db.get_connections_internal url.response: " + url2.response);
    log("jjj TVL4.db.get_connections_internal url.response headers:\n" +
    "Content-Type: " + url2.getResponseHeaders("Content-Type") + "\n" +
    "Vary: " + url2.getResponseHeaders("Vary"));
    if (! (200 === url2.response)) {
    TVL4.db.connection_type_filter = [];
    TVL4.db.connections = [];
    TVL4.pushView("view-ServerTrouble");
    return;
    }
    temp = JSON.parse(url2.result);
    if (temp.connections) {
    temp = temp.connections;
    }
    if (! (TVL4.is_array(temp))) {
    TVL4.db.connection_type_filter = [];
    TVL4.db.connections = [];
    TVL4.pushView("view-ServerTrouble");
    return;
    }
    Array.prototype.map.call(temp, function (c) {
    log("jjj name: " + c.name);
    });


    I included all the error handling code this time in case it was somehow spoiling the parsing.

    Thanks! Jay
    0
  • I believe it might be that the "Accept-Charset" is case sensitive or it is slight variation seen in a few browsers.

    Try this:
    CODE
    url.setRequestHeader("Accept-Charset", "UTF-8");

    OR:
    CODE
    url.setRequestHeader("Accept-Charset", "utf8");
    url.setRequestHeader("Accept-Charset", "UTF8");
    0
  • I tried all four combinations of {UTF, utf}x{-8, 8} and the symptoms are the same in all four cases. The TV is parsing the byte stream as if it were 8859-1 encoded text rather than UTF-8 as specified by the server.

    Any other ideas how to get the Yahoo! code to parse the response from the server as UTF-8 as the server specifies?
    0
  • I updated the code to provide slightly more insight into what's happening in the debug log.

    Source:
    CODE
        url.setRequestHeader("Accept-Charset", "utf8");
    url.fetchAsync(function (url2) {
    log("jjj TVL4.db.get_connections_internal url.response: " + url2.response);
    log("jjj TVL4.db.get_connections_internal url.response headers:\n" +
    "Content-Type: " + url2.getResponseHeaders("Content-Type") + "\n" +
    "Vary: " + url2.getResponseHeaders("Vary"));
    if (! (200 === url2.response)) {
    TVL4.db.get_connections_internal_failed();
    return;
    }
    (function () {
    var start_result_index = url2.result.indexOf('"connectionId": 2730,');
    if (-1 < start_result_index) {
    log("jjj TVL4.db.get_connections_internal url2.result: '" +
    url2.result.substr(start_result_index, 150) + "'");
    }
    })();
    temp = JSON.parse(url2.result);
    if (temp.connections) {
    temp = temp.connections;
    }
    if (! (TVL4.is_array(temp))) {
    TVL4.db.get_connections_internal_failed();
    return;
    }
    Array.prototype.map.call(temp, function (c) {
    log("jjj name: " + c.name);
    });


    And in the logs we see the problem is already present in url2.result before parsing the result as JSON.
    CODE
    WR 00:00:21:455: [T:3643] FINISHED 4.95 KB IN 0.14 s (36.64 KB/s) [http://api.splat.tv/v03/connections.jsp?lod=all&episode=848]
    WM 00:00:21:455: [T:3643] jjj TVL4.db.get_connections_internal url.response: 200
    WM 00:00:21:455: [T:3643] jjj TVL4.db.get_connections_internal url.response headers:
    Content-Type: text/html;charset=UTF-8
    Vary: Accept-Encoding
    WM 00:00:21:456: [T:3643] jjj TVL4.db.get_connections_internal url2.result: '"connectionId": 2730,
    "name" : "Enquête sur un citoyen au-dessus de tout soupçon",
    "title" : "Enquête sur un citoyen au-dessus de tout soupçon",
    "'
    WM 00:00:21:456: [T:3643] jjj name: A Sleep Be Told
    WM 00:00:21:456: [T:3643] jjj name: Enquête sur un citoyen au-dessus de tout soupçon
    WM 00:00:21:456: [T:3643] jjj name: Faces In The Dark
    WM 00:00:21:457: [T:3643] jjj name: Living a Lie


    Is there any API documented or otherwise to make the URL 'class' parse the result as UTF-8 rather than as 8859-1 as it seems to be doing now?
    0
  • One other note on this issue. The same data is fetched from our servers by a few different web pages/applications using Perl, Ruby, Java and Javascript and is displayed correctly in all cases. I even used the SQL command line to verify that the text string is in a database table with UTF-8 encoding, and used wget to verify that the byte stream delivered by the server seems to be in UTF-8 format. I suppose some other subtle weird effect might be taking place, but right now all signs point to C3 AA being (the interesting part of) the byte stream delivered to the YCTV widget, and that the correct headers are on the response from the server to indicate that the byte stream should be interpreted as UTF-8 encoded unicode text.

    I am at a loss, and the only workaround I see is to wrap the text in a layer of encoding like base64. The only client of our web services that would need this extra layer of encoding is also the most underpowered client, the Yahoo! Connected TV. Please tell me I am doing something wrong, and what other API I can call to fix the problem.

    Thanks! Jay
    0
  • QUOTE (jsl4tv @ Oct 21 2010, 07:11 AM) <{POST_SNAPBACK}>
    One other note on this issue. The same data is fetched from our servers by a few different web pages/applications using Perl, Ruby, Java and Javascript and is displayed correctly in all cases. I even used the SQL command line to verify that the text string is in a database table with UTF-8 encoding, and used wget to verify that the byte stream delivered by the server seems to be in UTF-8 format. I suppose some other subtle weird effect might be taking place, but right now all signs point to C3 AA being (the interesting part of) the byte stream delivered to the YCTV widget, and that the correct headers are on the response from the server to indicate that the byte stream should be interpreted as UTF-8 encoded unicode text.

    I am at a loss, and the only workaround I see is to wrap the text in a layer of encoding like base64. The only client of our web services that would need this extra layer of encoding is also the most underpowered client, the Yahoo! Connected TV. Please tell me I am doing something wrong, and what other API I can call to fix the problem.

    Thanks! Jay

    Hi Jay,
    We've looked into this and have determined that the data has been transcoded twice. It looks as though it was stored as UTF-8 but then at some point it was mistranscoded from ISO-8859-1into UTF-8 which results in the mojibake.

    Load this into a browser and observe the mojibake: http://api.splat.tv/v03/connections.jsp?lo...amp;episode=848

    One possible scenario is that your database contains valid UTF-8, but when you select data from the database, the character encoding is either (incorrectly) explicitly set to ISO-8859-1, or it is by default treated as if it were ISO-8859-1, and it is being transcoded to UTF-8.

    - Ben
    0
  • I compared the logs of fetching that url today to the logs from a few days ago when I started debugging this issue. It seems that you are right, the correct Unicode has gone through some conversion to UTF-8 twice. Perhaps 8859-1 -> utf-8, and then the correct utf-8 again converted as if it were 8859-1. I am embarrassed, but happy to know the problem is in our code, and that we can fix it there.

    Thank you very much for looking into the problem!

    Jay
    0

Recent Posts

in Design / Interaction - Yahoo! TV Widgets