Help me with a script

by banel
11 replies
Hi guys, i need a script that can export the ires div from google. I have this:

Code:
$url = 'hxxp://google.com/search?q=site:warriorforum.com&hl=en&num=100';
$m= file_get_contents ($url);
$specific_div = 'ires';
preg_match_all('#<ol\s*(?:id|class)\s*=\s*"'.preg_quote($specific_div).'">(.+?)</ol>#is', $m, $match);
print implode("<br>",$match[1]);
I used this a while ago with another script... I tried to make a scraper but it seems that the above code it's not working.

Help me please.
#script
  • Profile picture of the author nmarley
    Why not first download the web page and put the HTML into a file?

    Then write a separate script to parse that file and extract the data you want.

    This will accomplish 2 things:

    1) Your development goes a lot faster, since you don't have to hit google every time you make changes to your regex.

    2) You can actually see the data which you're searching through and change your regex accordingly (narrowing it down until you get what you need, etc).
    {{ DiscussionBoard.errors[3035760].message }}
  • Profile picture of the author badwolf
    are you sure the regexp is working? why not give us a sample of the html so we can check?
    Signature
    www.videocardroundup.com -- news/views/tutorials for all things related to PC video/graphics cards
    www.peonyplants.net -- the gorgeous world of peonies
    {{ DiscussionBoard.errors[3036947].message }}
    • Profile picture of the author banel
      Hi, thanks both of you for reply.

      @nmarley, @badwolf: That is what i want to do, to put the html into a file, but above i wrote the actual link for you guys to see the source. So, my html source file is the source from that link.

      I'm just having problems with regex.
      {{ DiscussionBoard.errors[3037027].message }}
  • Profile picture of the author Arbitbet
    $url = 'hxxp://google.com/search?q=site:warriorforum.com&hl=en&num=100';
    $m= file_get_contents ($url);
    preg_match('/<div id=ires><ol>.*<\/ol><\/div>/Usi', $m, $matches);
    print $matches[0];

    please check.
    {{ DiscussionBoard.errors[3037060].message }}
  • Profile picture of the author banel
    @Arbitbet: Thank you sooo much!!! Can you recommend me a site where i can learn more about preg_match... i want to be able to make my own scripts. As you see, i need simple tasks and i don't want to come back where every time i have a problem.
    Thanks again man.
    {{ DiscussionBoard.errors[3037204].message }}
  • Profile picture of the author banel
    @Arbitbet: One more question: if i want to get h3 class="r" ? I tried this but don't work.
    preg_match('/<div id=ires><ol><li class=g><h3 class=r>.*<\/h3><\/li><\/ol><\/div>/Usi', $m, $matches);
    {{ DiscussionBoard.errors[3037285].message }}
    • Profile picture of the author Arbitbet
      Originally Posted by banel View Post

      @Arbitbet: One more question: if i want to get h3 class="r" ? I tried this but don't work.
      preg_match('/<div id=ires><ol><li class=g><h3 class=r>.*</h3></li></ol></div>/Usi', , );
      This is wrong.

      I would have done so:
      preg_match('/<div id=ires><ol>.*<\/ol><\/div>/Usi', $m, $matches); // scrape div ires
      //for next step you must look, uses var_dump($matches), what you receive from last command
      preg_match_all('/<h3 class="r">.*<\/h3>/Usi', $matches[0], $temp); // scrape h3 tags from div ires, preg_match_all - because div have many h3 tags
      var_dump($temp[0]); // look result

      Divide and Conquer!

      Another way hxxp://php.net/manual/en/book.dom.php you can use DOM model.

      About Regular Expression you can look it hxxp://php.net/manual/en/book.pcre.php or book "Mastering Regular Expressions" Jeffrey Friedl.
      {{ DiscussionBoard.errors[3037603].message }}
      • Profile picture of the author Eager2SEO
        Originally Posted by Arbitbet View Post

        This is wrong.

        I would have done so:
        preg_match('/<div id=ires><ol>.*</ol></div>/Usi', , ); // scrape div ires
        //for next step you must look, uses var_dump(), what you receive from last command
        preg_match_all('/<h3 class="r">.*</h3>/Usi', , ); // scrape h3 tags from div ires, preg_match_all - because div have many h3 tags
        var_dump(); // look result

        Divide and Conquer!

        Another way hxxp://php.net/manual/en/book.dom.php you can use DOM model.

        About Regular Expression you can look it hxxp://php.net/manual/en/book.pcre.php or book "Mastering Regular Expressions" Jeffrey Friedl.

        Regular-Expressions.info - Regex Tutorial, Examples and Reference - Regexp Patterns is the best website for Regex. They truly deserve the backlink!

        Also, be careful with google HTML, some of the class and id identifiers do not have quotes.

        I use this to scrape webpages, but it will fail on google because of the above reason. I had to write a regex to add quotes to all unquoted identifiers....

        PHP Simple HTML DOM Parser

        This library works great if you master it.

        for geeks:
        HTML is not "context free" (Context-free grammar - Wikipedia, the free encyclopedia) so it is not optimal for regex processing. You should use DOM or XML tools if they are available.

        I do use regex for html when I work in .net. You have to be careful though about spacing and bad tags.
        Signature

        Available for article writing or <?php | .net ?> programming work! Article samples available on request.

        {{ DiscussionBoard.errors[3038825].message }}
        • Profile picture of the author banel
          Thanks again arbitbet. The script works. So, when i use preg_match_all, to output the results i need to use var_dump and nothing else? Because before you used print.
          {{ DiscussionBoard.errors[3039767].message }}
          • Profile picture of the author Arbitbet
            Originally Posted by banel View Post

            Thanks again arbitbet. The script works. So, when i use preg_match_all, to output the results i need to use var_dump and nothing else? Because before you used print.
            var_dump show type of variable and construction of array and contents(var-dump give more info then print), function print show only contents.
            {{ DiscussionBoard.errors[3040855].message }}

Trending Topics