0

YQL: JSON output format throws away text nodes that XML output does not

When retrieving HTML with "select * from html" in YQL, if the selected node contains multiple textNodes as direct children, only the last textNode is returned as the value of "content" in JSON, although all the child textNodes are returned in format XML.

Test case:

http://kryogenix.org/random/yql-demo-2.html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>yql demo</title>
</head>
<body>

<p>Text which is all included</p>

<p>
Text which is not included.
<a>inner link: 1</a>

Text which is not included either.
<a>inner link: 2</a>
Text which is included (final text node).
</p>

</body>
</html>


XML output from " select * from html where url="http://kryogenix.org/random/yql-demo-2.html" and xpath='//p' ":

<?xml version="1.0" encoding="UTF-8"?>
<query xmlns:yahoo="http://www.yahooapis.com/v1/base.rng" yahoo:count="2" yahoo:created="2009-01-19T08:55:38Z" yahoo:lang="en-US" yahoo:updated="2009-01-19T08:55:38Z" yahoo:uri="http://query.yahooapis.com/v1/yql?q=select+*+from+html+where+url%3D%22http%3A%2F%2Fkryogenix.org%2Frandom%2Fyql-demo-2.html%22+and+xpath%3D%27%2F%2Fp%27">
<diagnostics>
<publiclyCallable>true</publiclyCallable>
<url execution-time="347">http://kryogenix.org/random/yql-demo-2.html</url>
<user-time>350</user-time>
<service-time>347</service-time>
<build-version>2009.01.12.16:11</build-version>
</diagnostics>
<results>
<p>Text which is all included</p>
<p>Text which is not included. <a>inner link: 1</a> Text which is not included either. <a>inner link: 2</a> Text which is included (final text node).</p>
</results>
</query>


JSON output from same query:

cbfunc({
"query": {
"count": "2",
"created": "2009-01-19T08:56:08Z",
"lang": "en-US",
"updated": "2009-01-19T08:56:08Z",
"uri": "http://query.yahooapis.com/v1/yql?q=select+*+from+html+where+url%3D%22http%3A%2F%2Fkryogenix.org%2Frandom%2Fyql-demo-2.html%22+and+xpath%3D%27%2F%2Fp%27",
"diagnostics": {
"publiclyCallable": "true",
"url": {
"execution-time": "99",
"content": "http://kryogenix.org/random/yql-demo-2.html"
},
"user-time": "102",
"service-time": "99",
"build-version": "2009.01.12.16:11"
},
"results": {
"p": [
"Text which is all included",
{
"a": [
"inner link: 1",
"inner link: 2"
],
"content": " Text which is included\n(final text node)."
}
]
}
}
});


Observe that in the XML, "Text which is not included." and "Text which is not included either." both appear. In the JSON they do not; it would seem that the value "content" in JSON output is repeatedly overwritten as new textNodes are encountered, so it ends up with the value of just the last one.

sil

by
4 Replies
  • QUOTE (sil @ Jan 19 2009, 12:58 PM) <{POST_SNAPBACK}>
    When retrieving HTML with "select * from html" in YQL, if the selected node contains multiple textNodes as direct children, only the last textNode is returned as the value of "content" in JSON, although all the child textNodes are returned in format XML.

    Test case:

    http://kryogenix.org/random/yql-demo-2.html

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
    <head>
    <title>yql demo</title>
    </head>
    <body>

    <p>Text which is all included</p>

    <p>
    Text which is not included.
    <a>inner link: 1</a>

    Text which is not included either.
    <a>inner link: 2</a>
    Text which is included (final text node).
    </p>

    </body>
    </html>


    XML output from " select * from html where url="http://kryogenix.org/random/yql-demo-2.html" and xpath='//p' ":

    <?xml version="1.0" encoding="UTF-8"?>
    <query xmlns:yahoo="http://www.yahooapis.com/v1/base.rng" yahoo:count="2" yahoo:created="2009-01-19T08:55:38Z" yahoo:lang="en-US" yahoo:updated="2009-01-19T08:55:38Z" yahoo:uri="http://query.yahooapis.com/v1/yql?q=select+*+from+html+where+url%3D%22http%3A%2F%2Fkryogenix.org%2Frandom%2Fyql-demo-2.html%22+and+xpath%3D%27%2F%2Fp%27">
    <diagnostics>
    <publiclyCallable>true</publiclyCallable>
    <url execution-time="347">http://kryogenix.org/random/yql-demo-2.html</url>
    <user-time>350</user-time>
    <service-time>347</service-time>
    <build-version>2009.01.12.16:11</build-version>
    </diagnostics>
    <results>
    <p>Text which is all included</p>
    <p>Text which is not included. <a>inner link: 1</a> Text which is not included either. <a>inner link: 2</a> Text which is included (final text node).</p>
    </results>
    </query>


    JSON output from same query:

    cbfunc({
    "query": {
    "count": "2",
    "created": "2009-01-19T08:56:08Z",
    "lang": "en-US",
    "updated": "2009-01-19T08:56:08Z",
    "uri": "http://query.yahooapis.com/v1/yql?q=select+*+from+html+where+url%3D%22http%3A%2F%2Fkryogenix.org%2Frandom%2Fyql-demo-2.html%22+and+xpath%3D%27%2F%2Fp%27",
    "diagnostics": {
    "publiclyCallable": "true",
    "url": {
    "execution-time": "99",
    "content": "http://kryogenix.org/random/yql-demo-2.html"
    },
    "user-time": "102",
    "service-time": "99",
    "build-version": "2009.01.12.16:11"
    },
    "results": {
    "p": [
    "Text which is all included",
    {
    "a": [
    "inner link: 1",
    "inner link: 2"
    ],
    "content": " Text which is included\n(final text node)."
    }
    ]
    }
    }
    });


    Observe that in the XML, "Text which is not included." and "Text which is not included either." both appear. In the JSON they do not; it would seem that the value "content" in JSON output is repeatedly overwritten as new textNodes are encountered, so it ends up with the value of just the last one.

    sil


    Sil,

    Thanks for giving us the details. As Sam tweeted, it looks like a bug with the accumulation logic in the conversion - we'll get a fix out soon.

    Jonathan
    0
  • QUOTE (Jonathan @ Jan 20 2009, 07:37 AM) <{POST_SNAPBACK}>
    Sil,

    Thanks for giving us the details. As Sam tweeted, it looks like a bug with the accumulation logic in the conversion - we'll get a fix out soon.

    Jonathan


    Looking at this a bit more, its not a bug. We're giving it some more thinking about how we want to handle these cases. For now, if you need to preserve all the HTML sequencing and presentation logic, I'd recommend getting it as XML rather than JSON.

    Jonathan
    0
  • QUOTE (Yqlblog @ Jan 20 2009, 11:22 AM) <{POST_SNAPBACK}>
    Looking at this a bit more, its not a bug. We're giving it some more thinking about how we want to handle these cases. For now, if you need to preserve all the HTML sequencing and presentation logic, I'd recommend getting it as XML rather than JSON.

    Jonathan


    Any progress on this issue?
    0
  • QUOTE (Abhinay @ Apr 19 2009, 06:07 AM) <{POST_SNAPBACK}>
    Any progress on this issue?


    We have a method that would produce some fairly cumbersome looking JSON to preserve the XML structure. Its workable but we're also focusing on a number of new features too, so its a matter of working out when to spend development and QA time on it. As XML does contain all the information you'd need I'd recommend consuming YQL's XML output in this instance.

    Jonathan
    0

Recent Posts

in YQL