0

how to use regex to return whole words only

Hi

my pipe is here: http://pipes.yahoo.com/pipes/pipe.info?_id=98949320c98308d1f7d1359bd60f68eb

i'm returning a simple rss feed but ideally would like to return just 30 words from the description field.

Initially in my pipe i've used a loop and sub string to return the first 200 characters but this is returning a cut off sometimes showing a partial word at the end which his unclean.

I've heard using regex can specifially show whole word sbut would like an example of how to do this if anyone has one?

thanks

Chris

by
1 Reply
  • hi,

    you have a few options. First, after seeing your feed, you might want to strip the description of all HTML tags first. For all (stripping HTML tags and the options discussed below), you'll need a regex module. A regex module basically allows you to replace a selected string by another one, with advanced possibilities to select that first string. For stripping HTML tags you may use:

    "In [item.description] replace \</?[^\>]+\> with [ ] g[x] s[x] m[ ] i[ ]", which means "select a string which begins with a '<', possibly followed by a '/', and ends up with '>', contains at least a character in-between, but no '>'. Replace that string with '' (nothing). The 'g' option (for 'global') tells the module to do that operation as long as a matched string is found.

    For cutting your description, you'll wish to select of it, but remembering the first n characters. ie, saving a (or several) sub-string(s) while selecting all of the string, and substitute the selection for a given string/pattern:

    "In [item.description] replace [a regex, see below] with [$1] g[ ] s[x] m[ ] i[ ] ($1 means "first saved string", ticking the 's' option to use the 'substitute' mode of the regex module.) Your options are:

    • for returning 30 words as you asked, you may use a regex of the form: ^((?:\b.+?\b){30}).*$ which would translate as "from start of string, count 30 'words' (of at least 1 character, a space or punctuation sign is counted as a word) and save that sub-string; select all that is left in string".
    • for returning 30 words again, not counting spaces and signs, you may use a regex of the form: ^((?:\b.+?\b.+?\b){30}).*$ which would translate as "from start of string, count 30 'words' (of at least 1 character) and save that sub-string; select all that is left in string", however this regex is computer intensive, and will limit the number of results.
    • for returning 200 characters, but not cutting the last word, you can simply use: ^(.{200}.*?\b).*$ which translates as "from start of string, count 200 characters, continue until you find a word boundary, and save that sub-string; select all that is left in string".

    here, you now have everything you need!

    1

Recent Posts

in Pipes