XPath fetch page charset problem - non-latin ISO characters showing as "diamonds"


This is my pipe: http://pipes.yahoo.com/pipes/pipe.edit?_id=2dbadaff7eb8fca88fa750dde42b350b

A couple of weeks ago, the XPath fetchpage module started producing "diamonds" instead of the actual letters (iso-8859-7).

Does anybody know if this can be solved?


16 Replies
  • It would seem that something has changed, whether it be the web page or Pipes. Anyway,as an alternative to the Xpath Fetch Page module you can use a YQL module with the query

    select * from html where url='http://www.esiea.gr/gr/main.html' and xpath='//blockquote/p/strong/font/a' and charset='iso-8859-7' and compat='html5'

  • Thank you very much. Worked like a charm!

  • Any idea on how I can use YQL in a loop in order to correctly decode the body as well?

  • OK, fixed decoding. Only thing missing: How can I simulate the "emit items as string" when using YQL?

  • Hmmm take a look at this pipe:


    and it uses this source: http://paul.donnelly.org/yql/string.xml

    you may have to change it to accommodate the charset param. I'll see if I can update my example, by tomorrow if you get stuck.

    http://webcache.googleusercontent.com/search?q=cache:Ov7wTTlL9nUJ:discuss.pipes.yahoo.com/Message_Boards_for_Pipes/threadview%3Fm%3Dte%26bn%3Dpip-DeveloperHelp%26tof%3D2%26rt%3D2%26frt%3D2%26dir%3Df%26ri%3D13475%26t%3Dc &cd=1&hl=en&ct=clnk&gl=us

    is a link to a simliar discussion.

    Thanks -paul

  • Yeap, that's it!

    I just addedd two declarations:

    <key id="charset" type="xs:string" paramType="variable" required="true"/> <key id="compat" type="xs:string" paramType="variable" required="true"/>

    Modified the query to this:

    var q = y.query('select * from html where url=@url and xpath=@xpath and charset=@charset and compat=@compat',{url:url,xpath:xpath,charset:charset,compat:compat});

    And added accordingly two text inputs in the pipe so that charset and compat can be adjusted by user.

    Again: Thanks a lot to all of you who answered!

  • Hello!

    I'm trying to build RSS feed from a webpage: http://www.gameblog.fr/actualite/jeux-independants/

    I've already found very interesting tools in this page to build that: http://pipes.yahoo.com/pipes/pipe.edit?_id=330e1b9eb39f23aedbc72636a0af4ba5

    but i have an additional question:

    Using the YQL module, I get wrong URL while using Xpath fetch page I obtain the good ones. (I wanted to use YQL to avoid problems of encoding... I get the same "diamonds" as Sotos F)

    Is there a way to get the text and url right?

    Thanks in advance,


  • Using the YQL module add a rule to your Regex module In [item.link] replace [^] with [http://www.gameblog.fr] Omit all []s. Do the same for the image source.

  • Thank you! It works very well:

  • I have a very naive question... How can I know what are the meaning of symbols I have to use in the different modules like Gegex, string builder...etc? Is there some "command list" available somewhere?

    Thanks! Julien

  • If I understand you correctly, there are only 3 modules that use "symbols". For the Regex and String Regex modules you could check out http://www.regular-expressions.info/tutorial.html. For the Xpath module you will have to do a web search as I have no recommended site as yet.

  • Thank you, this tutorial is very usefull. That is exactly what I was looking for. For the Xpath module, I have already found various sources providing me the basics I have to know (without details that I don't really need). The string builder is still enigmatic to me... For example I found somewhere that for a "carrage return" I have to write:
    . (http://pipes.yahoo.com/pipes/pipe.edit?_id=330e1b9eb39f23aedbc72636a0af4ba5) Is that very common? I'm not use to programming... Does a list of such basic symbols exist?

  • Sry... I have written in my last post the word 'line' between two (with X=br) and it did the carrage return in the post... That must be related to the topic of the post... Is there some basic background I have missed?

  • The < b r > (no spaces) is just a standard HTML tag that represents a line break.

  • thanks yql worked for me for turkish characters, too. used your first advise.

  • I used YQL in a loop in order to correctly decode the body as well.thanks yql worked for me for german characters, too. used your first advise.


Recent Posts

in Pipes