On June 2, 2009, I gave a presentation in London with Skills Matter, an open source education group. The presentation --Trainspotting with BOSS ( slideshow | podcast ) -- demonstrated how an enthusiast could use BOSS for site search and build a robust web site.
Several concepts were brought up during the conversation that deserve a bit more thought.
- Dynamically populating the sites param.
- Can BOSS deliver the outbound links for a site?
These actually relate to each other. One person's concept was to use the aggregated links from a page to build the sites param. Let's look at them individually.
Dynamic "sites" Argument
Many vertical search engines use the sites argument to refine the results to a selected list of resources. It's the easiest method for building a vertical search engine and can generate great results, as described in Create a niche search engine with Yahoo! BOSS.
The sites argument contains a comma separated list of domains.
I built my first vertical search engine by building this string manually. My latest site is generating this string via the data I've compiled for each of my resources. This allows me to have a single point of reference, minimizing maintenance and development work.
The concept could easily be expanded. Why can't we analyze the user's query to dynamically generate a set of resources? Let's look at two methods.
Using Wikipedia for related information
There was a presentation at the www2009 conference, Understanding User's Query Intent with Wikipedia, which discussed using Wikipedia to generate related concepts. We can expand this concept for our test site.
Generating related topics
Delicious apple pie
Photo by bcostin.
The user searches for "apple pie". Our search engine sends this query to DBpedia (a data project based on Wikipedia) and gets related categories: American pies | British pies | Dutch cuisine | Sweet pies | Apple products | English cuisine. We use these categories to generate our BOSS requests.
Generating the sites attribute
When the user searches for apple pie we will also build BOSS requests for American pies, British pies, Dutch cuisine, and more. It's time to build the list of resources for each of those categories via a resource database.
The resource database will include the websites we've decided are great sources of information. We'll associate each resource with its categories. This allows us to quickly build our unique sites attribute by requesting all resources for apple pie. DBpedia includes the external links to help you start your database.
Now that you've got a unique set of resources for apple pie, as well as the related categories, you can begin working with the results to increase the value of your search results. You could merge the results and re-rank them by the number of times a result is present in the final set. You could also provide the user with related search tabs or side-bar modules.
Building the "sites" via outbound links
BOSS does not offer external link information in the search results. Grabbing these will require some scraping. Yahoo! Query Language (YQL) will make this a lot easier. The YQL console includes an example for grabbing headline links from the Yahoo! Finance home page.
Let's build a new search engine that harnesses the real-time power of Twitter. Once again the users searches for apple pie. We do a search of Twitter and get the last 100 or so tweets about the subject. We can then scrape the links people are sharing in their tweets that theoretically have relevant information.
This technique could be more useful when working with more specific search engines and/or more appropriate link aggregators.
Our next search engine is for plants and vegetables. We know there is a very active site that documents horticulture. We do a search of this site and scrape the links from the individual pages to create our sites param. This is especially useful for sites with forums or active comments, such as Botany.com.
This is easier in YQL if the site adds a particular class or 'rel="nofollow"' to external links, i.e. TechCrunch.
Take BOSS one step further
BOSS is an extremely flexible API. Start thinking about how you can pull the right content from the Yahoo! Search index for your site.