0

disappointed with the syntax

I just watched Jonathan's presentation at http://jonathantrevor.net/?p=15. As someone who is very familiar with SQL, I don't understand why YQL was made to look like SQL in some ways but not others. Why have a SELECT statement but instead of an ORDER BY clause, you have to post-process with a SORT function? And instead of SELECT DISTINCT, why post-process with a UNIQUE function?

The choice of an IN sub-query being the only join operation is another strange decision. Joining together 5 tables will require writing very messy, unmanageable code (assuming multiple joins are allowed in a single SELECT statement). INNER JOIN would be a much more elegant and maintainable way of doing this.

Why reinvent the wheel when you could have modeled it on ANSI SQL which has been around for over 20 years? I am guessing the answer is it was quicker and easier to reuse Yahoo's pipe functionality but taking the quick approach is not going to attract serious developers to your technology.

Michael Frantz

by
  • M
  • Oct 27, 2008
4 Replies
  • QUOTE (frantzmff @ Oct 27 2008, 08:58 AM) <{POST_SNAPBACK}>
    I just watched Jonathan's presentation at http://jonathantrevor.net/?p=15. As someone who is very familiar with SQL, I don't understand why YQL was made to look like SQL in some ways but not others. Why have a SELECT statement but instead of an ORDER BY clause, you have to post-process with a SORT function? And instead of SELECT DISTINCT, why post-process with a UNIQUE function?

    The choice of an IN sub-query being the only join operation is another strange decision. Joining together 5 tables will require writing very messy, unmanageable code (assuming multiple joins are allowed in a single SELECT statement). INNER JOIN would be a much more elegant and maintainable way of doing this.

    Why reinvent the wheel when you could have modeled it on ANSI SQL which has been around for over 20 years? I am guessing the answer is it was quicker and easier to reuse Yahoo's pipe functionality but taking the quick approach is not going to attract serious developers to your technology.

    Michael Frantz


    Michael,

    Thanks for the post. First, this is our initial launch of YQL. We're certainly not "finished" and we're continuing to work on new features that we feel are missing. Feedback and questions will help us focus on what our developers think are missing. So while I think we've got a core set of functionality now which is incredibly useful I know there are many things that it doesn't do that would make it more useful.

    To address a couple of your points. For many tables YQL does not have access or control of the "full" database (or rather full set of XML documents), and relies on the web service interfaces provided by the sources. Where these sources do not support ordering or grouping its not straightforward to execute entire set operations like these. The POST query functions operate ONLY on the results return by that query. Rather than mixing entire table-wide operations with ones that could only work on the returned (finite) set, we separated these out to POST query functions. Therefore SELECT DISTINCT is not the same as SELECT | UNIQUE(field) except in the case where the SELECT happened to work over the *entire* remote data. So we didn't attempt to overload the meaning from ANSI SQL with something that doesn't really do the same thing. We want to support these types of SQL operations when we can execute them correctly, which we may be able to do depending on the data source.

    Using IN rather than left or inner joins is another example of a starting point, but not the final solution. One of the interesting things when dealing with hierarchical data is how we should handle the cross-product between the two sets of documents should you choose to select partially from both and retain a valid XML document. Rather than start with that, we choose to use IN-selects only. Its clearly "possible to get there from here" (we could, for example, only allow selection on a single table from multiple left joins) and something we're looking at as we go forward - after all you can re-write left joins to in-selects and v.v. It would definitely be more elegant for the type of joins you're talking about.

    Thanks again for the feedback. There was no (existing) Yahoo pipes technology used in YQL, but we are the same team and we brought ideas and concepts across where we felt they made sense. For the most part it wasn't "quicker and easier" but decisions based on some of the issues I talk about above that we believe gives developers something that's predicable, consistent, easy to use and powerful - as well a starting point for further work (both adding additional core SQL or other functionality).

    Jonathan
    0
    • M
    • Oct 28, 2008
    Jonathan,

    Thanks for the thoughtful reply. Sorry for the slightly harsh tone of my post. It is clear that you are giving this a lot of thought and this is the starting point for YQL rather than the finished product.

    I created some very simple examples in a mySQL database in order to reply to the points you raised. The examples use two tables, employee and project:

    CREATE TABLE `employee` (
    `name` varchar(100) NOT NULL,
    `projectname` varchar(100) NOT NULL);

    INSERT INTO `employee` (`name`, `projectname`) VALUES
    ('Bob', 'Web'),
    ('Bob', 'Support'),
    ('Jane', 'Database'),
    ('Tim', 'Support'),
    ('Tim', 'Sales'),
    ('Rudy', 'Odd Project');

    CREATE TABLE `project` (
    `projectname` varchar(100) NOT NULL,
    `location` varchar(100) NOT NULL);

    INSERT INTO `project` (`projectname`, `location`) VALUES
    ('Web', 'New York'),
    ('Database', 'New York'),
    ('Support', 'London'),
    ('Sales', 'London');



    The POST query functions operate ONLY on the results return by that query. Rather than mixing entire table-wide operations with ones that could only work on the returned (finite) set, we separated these out to POST query functions.

    To my way of thinking, DISTINCT is closer to a post-query operation rather than a table-wide operation. Using my sample tables, if I wanted the list of employees that have projects in New York, I would write this ANSI SQL query:

    SELECT
    DISTINCT name
    FROM
    employee e
    INNER JOIN
    project p
    ON p.projectname = e.projectname
    WHERE
    location = 'New York'


    Which would return this result:

    Bob
    Jane


    We cannot get the unique names in the employee table until after we have joined to the project table. If we got the unique names in the employee table before the join, we would lose their projects and couldn’t filter on location.

    So this is an incredibly long-winded way of saying, I think DISTINCT is like a post-query operation and don’t really see the need for introducing UNIQUE(field) (or SORT(field) for that matter).

    I understand that in YQL, you can limit the number of rows returned from each table, e.g. local.search(1000), which is a divergence from ANSI SQL. But in my view, this doesn’t necessitate the additional divergence of introducing UNIQUE(field) and SORT(field).


    ...after all you can re-write left joins to in-selects and v.v.

    I’m not following you on this. If I write this LEFT OUTER JOIN query against my tables:


    SELECT
    name
    FROM
    employee e
    LEFT OUTER JOIN
    project p
    ON p.projectname = e.projectname


    It returns these results including Rudy:

    Bob
    Bob
    Jane
    Tim
    Tim
    Rudy




    Now rewriting this using an IN sub-query:

    SELECT
    name
    FROM
    employee
    WHERE
    projectname IN (SELECT projectname FROM project)


    I got the same results except Rudy is missing:

    Bob
    Bob
    Jane
    Tim
    Tim


    My point is that I don’t see LEFT OUTER JOINs and IN sub-queries as being equivalent. I see IN sub-queries filtering data in an equivalent way to an INNER JOIN.

    The other odd thing about making an IN sub-query the only way of joining tables in YQL is that it doesn’t give me a way of returning columns from BOTH tables. For example, how could I write this query in YQL where I want both the name and location?

    SELECT
    name, location
    FROM
    employee e
    INNER JOIN
    project p
    ON p.projectname = e.projectname



    Stepping back for a moment and looking at the big picture, creating a new database engine from scratch that supports the cases that we are discussing is a daunting task that requires a sizable team and many years. Have you looked into extending an existing open source platform like mySQL? In other words, leveraging their code base which has already solved these problems but introducing new system tables, like social.profile and social.connections, which access Yahoo! / internet data.

    Regards,
    Michael Frantz
    0
  • Frank,

    Im just going to focus on a small part of your post to clarify something which I dont think I was clear enough on earlier.

    QUOTE (frantzmff @ Oct 28 2008, 03:12 AM) <{POST_SNAPBACK}>

    INSERT INTO `employee` (`name`, `projectname`) VALUES
    ('Bob', 'Web'),
    ('Bob', 'Support'),
    ('Jane', 'Database'),
    ('Tim', 'Support'),
    ('Tim', 'Sales'),
    ('Rudy', 'Odd Project');


    The POST query functions operate ONLY on the results return by that query. Rather than mixing entire table-wide operations with ones that could only work on the returned (finite) set, we separated these out to POST query functions.

    To my way of thinking, DISTINCT is closer to a post-query operation rather than a table-wide operation. Using my sample tables, if I wanted the list of employees that have projects in New York, I would write this ANSI SQL query:


    Lets go to an even simpler query:

    SELECT
    DISTINCT name
    FROM
    employee e

    So the db, or YQL, needs to go through every employee and collapse the set down to those with unique names. This is great for the example data above, which is small, where YQL could easily fetch it all from the remote data provider with a few requests. Now lets suppose this data is behind a web service (as is most in YQL) and that there are millions of entries (as there are in certain tables), and that service doesn't expose any "grouping" or "distinct" feature as part of its web service interface (which is the norm). If I wanted get the correct result we'd have to fetch every single item from the remote database (as the very last item could be distinct). This obviously is problematic.

    So rather than add "distinct" into the language, which most data sources/tables don't support, we chose to add a set of piped POST-select functions to work on the set of data that we would be returning anyway. Things like sort and unique are locally correct within the results, but not globally correct to the source dataset. Any table wide operations (sort etc) have this problem. I guess that if you understand that you are dealing with a limited "view" or subset of the underlying data set to start with (somewhat arbitrarily chosen based on the source providing the "rows"), local.search(1000) for example, then sort and distinct for the subset would do what you'd expect. We actually wanted to avoid this to start with as we believe will lead to a lot of unexpected results. Definitely food for thought as we improve our REST-service table bindings, or bindings to other data sources.

    Jonathan
    0
    • M
    • Oct 28, 2008
    Jonathan,

    Yes, I understand completely. I read the yql_openhackday.pdf "Changing the table size" section before I posted. I know that local.search(1000) gives me an arbitrary 1000 rows from the data source and local.search(0) returns all rows from the data source.

    Let’s change the simple example to include a YQL table size filter:

    SELECT
    DISTINCT name
    FROM
    Employee(1000)


    I am comfortable with this because to me DISTINCT simply means no duplicate values will be returned in the results. It does not mean every single row in the data source has to be examined.

    I can write the identical mySQL query like this:

    SELECT
    DISTINCT name
    FROM
    Employee
    LIMIT 1000


    See the mySQL documentation: http://dev.mysql.com/doc/refman/5.0/en/lim...timization.html

    “When combining LIMIT row_count with DISTINCT, MySQL stops as soon as it finds row_count unique rows.”

    So mySQL does not have to scan every single row in the data source. It just returns an arbitrary 1000 unique values.

    Mike Frantz
    0

Recent Posts

in YQL