In this Chapter
- Introduction to SELECT
- Syntax of SELECT
- Specifying the Elements Returned (Projection)
- Filtering Query Results (WHERE)
- Joining Tables With Sub-Selects
- Paging and Table Limits
- Sort and Other Functions
- Remote and Local Processing
- GUIDs, Social, and Me
- Variable Substitution in the GET Query String
- Extracting HTML
Introduction to SELECT¶
The SELECT statement of YQL retrieves data from YQL tables. The YQL Web Service fetches data from a back-end datasource (often a Web service), transforms the data, and returns the data in either XML or JSON format. Table rows are represented as repeating XML elements or JSON objects. Columns are XML sub-elements or attributes, or JSON name-value pairs. To try out some SELECT examples and to view the results, run the YQL Console and click the items under “Example Queries” or “Data Tables”.
Syntax of SELECT¶
The YQL SELECT statement has the following syntax:
SELECT what FROM table WHERE filter [| function]
The what clause contains the fields (columns) to retrieve. The fields correspond to the XML elements or JSON objects in the data returned by the SELECT. An asterisk (the “*” character) in the what clause means all fields. The table is either the YQL pre-defined or Open Data Table that represents a datasource. (Unlike in SQL, in YQL only one table can be specified.) The filter is a comparison expression that limits the rows returned. The results of the SELECT can be piped to an optional function, such as sort.
In YQL, statement keywords such as SELECT and WHERE are case-insensitive. Table and field names are case sensitive. In string comparisons, the values are case sensitive. String literals must be enclosed in quotes; either double or single quotes are allowed.
Specifying the Elements Returned (Projection)¶
To get a vertical slice (projection) of a table, specify the fields in the clause following the SELECT keyword. In YQL, these fields are analogous to the columns of a SQL table. Multiple fields are delimited by commas, for example:
select lastUpdated, itemurl from social.updates where guid=me
To get all fields, specify an asterisk:
select * from social.updates where guid=me
If the fields in the result set contain sub-fields, you can indicate the sub-fields by using periods (dots) as delimiters. (Sometimes this format is called “dot-style syntax.”) For example, for the social.profile table, to get only the imageUrl sub-field of the image field, enter the following:
select image.imageUrl from social.profile where guid=me
The following lines show part of the XML response for this SELECT. Note that only the imageUrl sub-field is returned.
... <results> <profile xmlns="http://social.yahooapis.com/v1/schema.rng"> <image> <imageUrl>http://l.yimg.com/us.yimg.com/i/identity/nopic_192.gif</imageUrl> </image> </profile> </results>
If you specify one or more non-existent fields in the what clause, the HTTP response code is 200 OK. If none of the fields in the what clause exist, the result set is empty. (That is, zero rows are returned.) Note that field names are case sensitive.
Filtering Query Results (WHERE)¶
The filter in the WHERE clause determines which rows are returned by the SELECT statement. The filter in the following statement, for example, returns rows only if the text field matches the string sunnyvale:
select * from geo.places where text=’sunnyvale’
Filters can also be used to return rows if a field is contained in a list of values. The filter in this statement returns rows if the the location field matches “San Francisco” or “San Jose”.
select * from upcoming.events where location in ("San Francisco", "San Jose")
With a remote filter, the filtering takes place in the back-end datasource (usually a Web service) called by the YQL Web Service.
A remote filter has the following syntaxes:
input_key IN [list of one or more literals]
The input key is a parameter that YQL passes to the back-end datasource. The literal is a value, either a string, integer, or float. The equality (=) operator and the IN operator are allowed in a remote filter. (A local filter, in contrast, can contain other types of comparison operators.)
For example, in the following statement, the input key is photo_id:
select * from flickr.photos.info where photo_id='2186714153'
For this SELECT statement, the YQL Web Service calls the Flickr Web Service, passing photo_id as follows:
Using the IN operator, the following statement returns the geo data for Sunnyvale, Mountain View, and Cupertino.
select * from geo.places where text IN ("Sunnyvale", "Mountain View", "Cupertino")
Most YQL tables require the SELECT statement to specify a remote filter, which requires an input key. Often, the input key is not one of the fields included in the results returned by a SELECT. To see which input keys are allowed or required, enter the DESC statement for the YQL table and note the key XML element of the results. For example, the results of DESC flickr.photos.info show that the input key photo_id is required:
<results> . . . <select> <key name="secret" type="xs:string"/> <key name="photo_id" required="true" type="xs:string"/> </select> . . . <results>
Multiple remote filters can be combined with the boolean AND or OR operators, for example:
select * from flickr.photos.info where photo_id='2186714153' or photo_id='3502889956'
The SELECT statements for some tables require multiple remote filters, for example:
select * from local.search where zip='94085' and query='pizza'
The YQL Web Service performs local filtering on the data it retrieves from the back-end datasource. Before examining the syntax of local filters, let’s look at a few examples.
In the following example, YQL gets data from the flickr.photos.interestingness table, then applies the local filter title='moon'.
select * from flickr.photos.interestingness where title='moon'
In the next statement, the local filter checks that the value of the title field starts with the string Chinese or CHINESE.
select * from flickr.photos.interestingness where title like 'Chinese%'
The filter in the following statement contains a regular expression that checks for the substring blue:
select * from flickr.photos.interestingness where title matches '.*blue.*'
The following statement returns recent photos with the IDs specified in the parentheses:
select * from flickr.photos.recent where id in ('3630791520', '3630791510', '3630791496')
A local filter has the following syntax:
field comparison_operator literal
The field (column) specifies the name of the XML element or JSON object in the results. To specify a sub-field, separate the containing fields with periods. For an example sub-field, see Rating.AverageRating in the SELECT statement in the Combining Boolean Operations table below. The literal is either a quoted string, an integer, or a float. The following table lists the allowed comparison operators.
|>=||Greater than or equal to.|
|<=||Less than or equal to.|
|[NOT] IN||Tests whether a value is contained in a set of values. This operator can be followed by either a sub-select or by a comma-delimited set of values within parentheses.|
|IS [NOT] NULL||Tests for the existence of the field in the results. An IS NULL expression is true if the field is not in the results.|
|[NOT] LIKE||Tests for a string pattern match. The comparison is case-insensitive. The “%” character in the literal indicates zero or more characters. For example, Sys% matches any string starting with Sys.|
|[NOT] MATCHES||Tests for a string pattern match, allowing regular expressions. The comparison is case sensitive.|
Combining Filter Expressions (AND, OR)¶
Local and remote filter expressions can be combined with the boolean AND and OR operators. The AND operator has precedence over the OR operator. To change precedence, enclose expressions in parentheses.
In the following example, the first two filters are remote expressions because query and location are input keys. The third filter, containing the field Rating.AverageRating, is a local filter.
select * from local.search where query="sushi" and location="san francisco, ca" and Rating.AverageRating="4.5"
Joining Tables With Sub-Selects¶
With sub-selects, you can join data across different YQL tables. (In SQL, a sub-select is usually called a “subquery.”) Because YQL tables are often backed by Web services, sub-selects enable you to join data from different Web services. In a join, the sub-select provides input for the IN operator of the outer select. The values in the outer select can be either input keys (remote filters) or fields in the response (local filters).
By using a sub-select, the following statement returns the profiles of all of the connections (friends) of the user currently logged in to Yahoo. This statement joins the social.profile and social.connection tables on the values of the GUIDs. The inner SELECT, which follows the word IN, returns the GUIDs for the user’s connections. For each of these GUIDs, the outer SELECT returns the profile information.
select * from social.profile where guid in (select guid from social.connections where owner_guid=me)
Tables can be joined on multiple keys. In the following example, the local.search and geo.places tables are joined on two keys. The inner select returns two data fields (centroid.latitude and centroid.latitude) which are compared with the two input keys (latitude and longitude) of the outer select.
select * from local.search where (latitude,longitude) in (select centroid.latitude, centroid.longitude from geo.places where text="north beach, san francisco") and radius=1 and query="pizza" and location=""
The next example shows an inner select that returns data from an RSS feed:
select * from search.web where query in (select title from rss where url="http://rss.news.yahoo.com/rss/topstories" | truncate(count=1))
One sub-select is allowed in each select. In other words, each select statement can only have one IN keyword, but the inner select may also have an IN keyword. The following statement is legal:
select name, description, ticket_url, url, venue_address, venue_state_code from upcoming.events where search_text in (SELECT artist.name FROM spotify.search.album where album in (select title from youtube.search where query = "Together" limit 5) limit 5)
However, the next statement is illegal because it has two IN keywords in a select:
ILLEGAL: select * from flickr.photos.search where lat in (select centroid.latitude from geo.places where text="sfo") and lon in (select centroid.longitude from geo.places where text="sfo")
Paging and Table Limits¶
Many YQL queries access datasources that contain thousands, or even millions, of items. When querying large datasources, applications need to page through the results to improve performance and usability. YQL enables applications to implement paging or limit table size at two levels: remote and local.
To find out how many items (rows) a query (SELECT) returns, in an XML response, check the value of the yahoo:count attribute of the query element. In a JSON response, check the value of the count object.
The maximum number of items returned by a SELECT is 5000. The maximum processing time for a YQL statement is 30 seconds. For most tables, the default number of items returned is 10. (That is, the default is 10 if you do not specify a limit in the SELECT statement.)
A remote limit controls the number of items (rows) that YQL retrieves from the back-end datasource. To specify a remote limit, enter the offset (start position) and number of items in parentheses after the table name.
For example, in the following statement, the offset is 0 and the number of items is 10. When this statement runs, YQL calls Yahoo Search BOSS (the back-end source for the search.web table) and gets the first 10 items that match the query="pizza" filter:
select title from search.web(0,10) where query="pizza"
The following statement gets items 10 through 30. In other words, starting at postition 10, it gets 20 items:
``select title from search.web(10,30) where query=”pizza” ``
The default offset is 0. For example, the following statement gets the first 20 items:
``select title from search.web(20) where query=”pizza” ``
The default number of items for a remote limit varies with the table. For most tables, the default number of items is 10.
The maximum number of items also varies with table. To get the maximum number of items, enter 0 in parentheses after the table name. The following statement returns 1000 items from the search.web table:
``select title from search.web(0) where query=”pizza” ``
A local limit controls the number of rows YQL returns to the calling application. YQL applies a local limit to the data set that it has retrieved from the back-end datasource. To specify a local limit, include the LIMIT and OFFSET keywords (each followed by an integer) after the WHERE clause. LIMIT specifies the number of rows and OFFSET indicates the starting position. The OFFSET keyword is optional. The default offset is 0, which is the first row.
The following statement has a remote limit of 100 and a local limit of 15. When this statement runs, YQL gets up to 100 items from the back-end datasource. On these items, YQL applies the local limit and offset. This statement returns 15 rows to the calling application, starting with the first row (offset 0).
select title from search.web(100) where query="pizza" limit 15 offset 0
YQL retrieves items from the back-end datasource one page at a time until either the local or remote limit has been reached. The page size varies with the table. The following statement has an unbounded remote limit (0) so YQL retrieves items from the backend datasource until the the local limit of 65 is reached:
select title from search.web(0) where query="pizza" limit 65
Typically, a SELECT statement includes limits and filters, as shown in Example With Remote and Local Steps.
Sort and Other Functions¶
YQL includes built-in functions such as sort, which are appended to the SELECT statement with the pipe symbol (“|”). These functions are applied to the result set after the SELECT statement performs all other operations, such as applying filters and limits.
In the following SELECT statement, the sub-select returns a list of GUIDs, and the outer select returns a set of profiles, one for each GUID. This set of profiles is piped to the sort function, which orders the results according to the value of the nickname field.
``select * from social.profile where guid in (select guid from social.connections where owner_guid=me) | sort(field=”nickname”) ``
Multiple functions can be chained together with the pipe symbol (“|”). The following statement queries the local.search table for restaurants serving pizza. The results are piped to the sort function, then to the reverse function. The final result contains 20 rows, sorted by rating from high to low.
``select Title, Rating.AverageRating from local.search(20) where query=”pizza” and city=”New York” and state=”NY” | sort(field=”Rating.AverageRating”) | reverse() ``
Results can also be sorted with multiple criteria. The sort function can take multiple fields with optional values specifying the sort direction. For example, the results returned from the above statement querying for restaurants serving pizza can be sorted first by the average rating (primary sort) and then by the title (secondary sort). In the modified version of the statement below, the results are first sorted by the average rating in descending order and then those restaurants with the same average rating are sorted by the title in ascending order.
select Title, Rating.AverageRating from local.search(20) where query="pizza" and city="New York" and state="NY" | sort(field="Rating.AverageRating", descending="true",field="Title", descending="false")
The following table lists the YQL functions that can be appended to a SELECT statement. Function arguments are specified as name-value pairs.
|sort||``field [descending] [hidden] ``||sort(field="nickname", descending="true", hidden="true" field="email")||Sorts the result set according to one or more criteria that are specified by field keys. The default value of the optional descending argument is false. If hidden is set to “true”, then the preceding field will be omitted from the items in the result set. Syntax: sort(field_1="value" [, descending_1="true"|"false"] [,hidden_1="true"|"false"], ... , field_n="value" [,descending_n="true"|"false"] [, hidden_n="true"|"false"])|
|tail||count||tail(count=4)||Gets the last count items (rows).|
|truncate||count||truncate(count=4)||Gets the first count items (rows).|
|reverse||(none)||reverse()||Reverses the order of the items (rows).|
|unique||field [hideRepeatCount]||unique(field="Rating.AverageRating", hideRepeatCount="true")||Removes items (rows) with duplicate values in the specified field (column). The first item with the value remains in the results. If hideRepeatCount is missing or set to "false", the item that remains in the results will be annotated with the element (XML) or field (JSON) yahoo:repeatcount specifying the number of items (including the returned one) with duplicate values in the specified field.|
|sanitize||[field]||sanitize(field='foo')||Sanitizes the output for HTML-safe rendering. To sanitize all returned fields, omit the field parameter.|
Remote and Local Processing¶
When YQL runs a SELECT statement, it accesses a back-end datasource, typically by calling a Web service. Remote filters and limits are implemented by the back-end Web service. Local processing (including local filters and limits) is performed by the YQL Web Service on the data it fetches from the back-end Web service. As shown by the following example, whether an operation is remote or local affects the data returned to the application that calls the SELECT statement.
Example With Remote and Local Steps¶
The following SELECT statement gets place information matching the string “San Francisco” from the geo.places table:
select lang, woeid, name, country.content from geo.places(150) where text="San Francisco" and lang in ("es") limit 10 | sort(field="country")
The steps that follow show the order in which YQL processes the remote and local parts of the SELECT statement:
YQL calls Yahoo GeoPlanet API, the Web service behind the geo.places table, at the following URL:
By calling this URL, YQL gets the first 50 items that match the SELECT statement’s remote filter: q(San Francisco). The q element is the text input key for the geo.places table. Although the remote filter in the SELECT is set to 100, to improve efficiency, YQL only fetches 50 items each time it calls the Web service.
On the 100 items it retrieved from the GeoPlanet API, YQL applies the local filter: lang in ("es"). This filter selects an item if the value lang field (column) is “es”. From the set of the 50 items, YQL finds a subset of items that match the local filter.
YQL checks that the items retrieved contain the lang, woeid, name, and country.content fields, which are specified after the SELECT keyword.
YQL calls BOSS again, incrementing the start parameter to 50, to get the next 50 rows:
On this second set of 50 items, YQL applies the local filter, and again only finds a subset of items matching the local filter.
On the second set of items from the GeoPlanet API, YQL applies the local filter and again verifies that the lang, woeid, name, and country.content fields, exist. Let’s suppose that YQL has already found 5 matches from the two sets of results. Because the local limit of 5 has been reached, YQL does not call the GeoPlanet API again.
YQL pipes the 5 items to the sort function, ordering the data by the country field.
YQL returns 5 rows (containing just the lang, woeid, name, and country.content fields) to the calling Web application.
Summary of Remote and Local Controls¶
The following table identifies whether an element in the SELECT statement is processed locally or remotely by YQL.
|Syntax Element in SELECT||Local or Remote||Section With Details|
|Columns or asterisk after the SELECT keyword.||Local||Specifying the Elements Returned (Projection)|
|Remote limit and offset, indicated by integers in parentheses after the table name.||Remote||Remote Limits|
|Remote filter expression in the WHERE clause. The allowed operators are the equal sign or IN. The value compared is an input key for the back-end datasource.||Remote||Remote Filters|
|Local filter expression in the WHERE clause. Various operators are allowed, including LIKE and MATCH. The value compared is a field (column) in the data returned by the query.||Local||Local Filters|
|LIMIT and OFFSET keywords after the WHERE clause.||Local||Local Limits|
|Sort and other functions after the pipe (||) symbol.||Local|
Variable Substitution in the GET Query String¶
If the URL contains @var literals, YQL replaces the literals with the values of query parameters with the same names. For example, suppose that the URL for the call to the YQL Web Service has the animal query parameter:
http://query.yahooapis.com/v1/yql?animal=dog&q=select * from sometable where animal=@animal
For this URL, YQL will run the following SELECT statement:
select * from sometable where animal="dog"
A key feature of YQL is the ability to access data from structured data feeds such as RSS and ATOM. However, if no such feed is available, you can use the html table to get HTML from a page. The html table can parse HTML 4.01 or HTML5 and be used with XPath to extract portions of the HTML page.
Parsing HTML 4.01 Versus HTML5¶
The html table by default parses any HTML page using HTML 4.01 specification, but it can be configured to parse using the HTML5 specification. To parse using HTML5 specification, you assign the compat key the value "html5" as seen in the first example below. When using the HTML5 specification to parse, the returned response may be slightly different.
For example, to get HTML from Yahoo Groups that is parsed according the the HTML5 specification, you would use the following YQL statement containing compat="html5":
select * from html where url="select * from html where url="http://groups.yahoo.com/search?query=surfing&sort=relevance" and compat="html5"
The same statement without compat="html" returns HTML parsed according to the HTML 4.01 specification:
select * from html where url="http://groups.yahoo.com/search?query=surfing&sort=relevance"
The html table without XPath returns all of the page’s HTML, which may not be useful in an application. By adding an XPath expression to the statement, you can retrieve specific portions of the HTML page.
The XPath expression in the following statement traverses through the nodes in the HTML page to isolate the latest headlines. In this case, the XPath expression looks first for a li tag with the class hbox groupsSearch-result-entry.
select * from html where url="http://groups.yahoo.com/search?query=surfing&sort=relevance" and compat="html5" and xpath='//li[contains(@class,"hbox groupsSearch-result-entry")]'
The following statement also gets information about Yahoo Groups, but traverses the nodes to get links to the different groups:
select * from html where url="http://groups.yahoo.com/search?query=surfing&sort=relevance" and compat="html5" and xpath='//li[contains(@class,"hbox groupsSearch-result-entry")]/h4/a'
Instead of the the wildcard asterisk (*), you can specify a particular element to process. To get just the content from an HTML page, you can specify content keyword after the word select. A statement with the content keyword processes the HTML in the following order:
- It looks for any element named “content” within the elements found.
- If an element named “content” is not found, the statement looks for an attribute named “content”.
- If neither an element nor attribute named “content” is found, the statement returns the element’s textContent.
For example, the following statement extracts only the HTML links (href tags) for Yahoo Groups:
select href from html where url="http://groups.yahoo.com/search?query=surfing&sort=relevance" and compat="html5" and xpath='//li[contains(@class,"hbox groupsSearch-result-entry")]/h4/a'
The following statement, for example, returns the textContent of each anchor (a) tag retrieved by the XPath expression:
select content from html where url="http://groups.yahoo.com/search?query=surfing&sort=relevance" and compat="html5" and xpath='//li[contains(@class,"hbox groupsSearch-result-entry")]/h4/a'