When retrieving HTML with "select * from html" in YQL, if the selected node contains multiple textNodes as direct children, only the last textNode is returned as the value of "content" in JSON, although all the child textNodes are returned in format XML.
Test case:
http://kryogenix.org/random/yql-demo-2.html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>yql demo</title>
</head>
<body>
<p>Text which is all included</p>
<p>
Text which is not included.
<a>inner link: 1</a>
Text which is not included either.
<a>inner link: 2</a>
Text which is included (final text node).
</p>
</body>
</html>XML output from " select * from html where url="http://kryogenix.org/random/yql-demo-2.html" and xpath='//p' ":
<?xml version="1.0" encoding="UTF-8"?>
<query xmlns:yahoo="http://www.yahooapis.com/v1/base.rng" yahoo:count="2" yahoo:created="2009-01-19T08:55:38Z" yahoo:lang="en-US" yahoo:updated="2009-01-19T08:55:38Z" yahoo:uri="http://query.yahooapis.com/v1/yql?q=select+*+from+html+where+url%3D%22http%3A%2F%2Fkryogenix.org%2Frandom%2Fyql-demo-2.html%22+and+xpath%3D%27%2F%2Fp%27">
<diagnostics>
<publiclyCallable>true</publiclyCallable>
<url execution-time="347">http://kryogenix.org/random/yql-demo-2.html</url>
<user-time>350</user-time>
<service-time>347</service-time>
<build-version>2009.01.12.16:11</build-version>
</diagnostics>
<results>
<p>Text which is all included</p>
<p>Text which is not included. <a>inner link: 1</a> Text which is not included either. <a>inner link: 2</a> Text which is included (final text node).</p>
</results>
</query>JSON output from same query:
cbfunc({
"query": {
"count": "2",
"created": "2009-01-19T08:56:08Z",
"lang": "en-US",
"updated": "2009-01-19T08:56:08Z",
"uri": "http://query.yahooapis.com/v1/yql?q=select+*+from+html+where+url%3D%22http%3A%2F%2Fkryogenix.org%2Frandom%2Fyql-demo-2.html%22+and+xpath%3D%27%2F%2Fp%27",
"diagnostics": {
"publiclyCallable": "true",
"url": {
"execution-time": "99",
"content": "http://kryogenix.org/random/yql-demo-2.html"
},
"user-time": "102",
"service-time": "99",
"build-version": "2009.01.12.16:11"
},
"results": {
"p": [
"Text which is all included",
{
"a": [
"inner link: 1",
"inner link: 2"
],
"content": " Text which is included\n(final text node)."
}
]
}
}
});Observe that in the XML, "Text which is not included." and "Text which is not included either." both appear. In the JSON they do not; it would seem that the value "content" in JSON output is repeatedly overwritten as new textNodes are encountered, so it ends up with the value of just the last one.
sil