0

imdb top films

Total newbie to Pipes, so apologies. But seeking out some guidance on where to start on the below.

I would like to capture the below results into a table. http://www.imdb.com/search/title?at=0&sort=moviemeter,asc&start=1&title_type=feature&year=2012

Could someone suggest some pipe elements I can use to pull this apart and output into a table?

by
6 Replies
  • Hello,

    First, analyse the means. You probably have three choices, scrapping the webpage as you propose, or getting the data proposed by the provider through some available feed, or getting the data from some API. It would seem that imdb do not provide an official API but you can use an unofficial one to cut the work. Of course, the problem with any API is that the format of the data returned might change with not much notice, and even less with prior notice.

    Back to your problem: scrapping the webpage. In pipes you have two tools dedicated: the XPath fetch page (let's call it XFP) module and the deprecated fetch page module (but still available, as XFP has some limitations for the time being).

    Knowing some basics about XPath is a requirement to use XFP, check this link and this one for syntax reference. In both cases, I would advice the following approach:

    • first open the source code of the webpage (ctrl+U with firefox).

    • next, identify the data you would like to fetch and its surrounding environment, especially distinctive features (if the data is in a list (li), in a table, and the identifier (id="...") of the container mark-up).

    • Then, play with pipes! in XFP you should probably begin with no xpath expression, and then unfold until you find your data (if you read the webpage source, it should be pretty straightforward). Use the unfolded path as the xpath in the module, and taada! here you go.

      In Fetch Page module, you first need to cut the relevant part of the source code, so find distinctive features above your data and after, put those strings in the module, and observe the output. Adjust, observe, and then you might fill the 'split delimiter' if needed.

    If I'm not clear enough, type scrap in the pipe search field, and start with a scrapping pipe as example. It should be easy that way! or come back and ask, obviously ;)

    lolo

    1
  • Thanks for the helpful reply.

    I have examined the source code and know that the values reside in results table (table.results).

    html.scriptsOn body#styleguide-v2.fixed div'wrapper div'root div'pagecontent div#content-2-wide div#main table.results tr.even.detailed td.number

    When I use XFP, I can various returns but all contain "null". They do show 50 items which matches the website.

    //* = Returns everything //tr = null //table[@class='results'] = null //tr[@class='even.detailed'] = null //tbody/tr[@class='even.detailed'] = no results //tr[@*] = null

    I am wondering if I need to further adjust my XFP settings or now add another Pipes module to convert the return into something more meaningful?

    0
  • you'll have to re-write your post... this forum has advanced formatting options but they can be a little swamping sometimes. use the ` symbol to quote html/xml code.

    Last, but not least by all means, post a link to your pipe!

    0
  • Thanks for your help.

    Update:

    Managed to get it to work although could do with some tweaking. Worked out that the debugger is extremely useful in analysing the query. I was not drilling down far enough to return the desired values.

    Published Pipe now available: http://pipes.yahoo.com/dm81/imdb50


    Previous (re)Post:

    Uh, good point. I should have previewed before posting,

    The values I wish to extract are held here - within table.results:

    html.scriptsOn 
    body#styleguide-v2.fixed 
    div'wrapper 
    div'root 
    div'pagecontent 
    div#content-2-wide 
    div#main 
    table.results 
    tr.even.detailed
    td.number
    

    Using the "Pipe Output" debugger the following xpath brings back 50 results. The 50 I require although in various .

    //table[@class='results']/tr/td/a[@title]
    

    I am wondering if I need to further adjust my XFP settings or now add another Pipes module to convert the return into something more meaningful?

    0
  • looks like my post disappeared, re post:

    unpublish/do not publish pipes in debug (you might think it's no longer in debug, but read the entire post before stating that ;) ).

    Ok, so looks like your pipe works, and it's better to use an on-point xpath expression that a too simplistic one, with a lot of operations afterwards, so good job.

    Now, the feed results in only the titles of the movies. However, you fetch more than that: you get thumbnails and imdb links, use it!

    thumbnail (replace [colon] by : else my post can't be ... posted): rename item.img to media[colon]thumbnail and media[colon]thumbnail.src to media[colon]thumbnail.url.

    link: rename item.href to link.

    You might also want to copy the link to y:id so your elements have a guid (some feed readers need this) and populate the description of your items, for example with synopsis and/or ratings by using the title in a query-like url to an API or a service and scrap the webpage.

    I have one pipe which does the first fetching approach with RottenTomatoes API and a second pipe which does the second fetching approach, aka scrapping-handle-scrapping, with my library's website if you want examples!

    enjoy playing with pipes ;)

    0
  • Thanks. Some good tips there. However I cannot get the y:id part to work.

    Could you explain or show an example of how the record number is returned (or a unique ID) is assigned to each record?

    0

Recent Posts

in Pipes