Scraping websites - use PHP and Regexp or something else?

11 replies
Hi there,

I am currently busy learning and doing PHP to help me make tools and other cool stuff for my websites. My first project I want to take on is to be able to make a website scraper to affiliate site script (all with permission from vendors of course).

Seems like it's pretty straightforward to get the html file, but then next I would want to extract the data. The standard solution seems to be using regular expressions, but then I read some other guys suggesting not using PHP for this stuff at all but rather some Python library?

Next I would want to get the data to my website. Would you need to store it in a MySql database or you could you go straight from array to to website?

I'm a newb with PHP though I do know programming basics, anyway, is the process I outlined above the right way to do it? I don't want to be headed down the wrong path!
#php #regexp #scraping #websites
  • Profile picture of the author kokopelli
    I suggest you look at how some other scripts do it, e.g. CaRP Evolution | PHP RSS Parser / RSS to HTML Converter | Free Download

    And here's a simple parser script I sometimes use:
    Code:
    <?php
    set_time_limit(0);
    
    $file = "http://www.nytimes.com/services/xml/rss/nyt/RealEstate.xml";
    
    $rss_channel = array();
    $currently_writing = "";
    $main = "";
    $item_counter = 0;
    
    function startElement($parser, $name, $attrs) {
        global $rss_channel, $currently_writing, $main;
        switch($name) {
         case "RSS":
         case "RDF:RDF":
         case "ITEMS":
          $currently_writing = "";
          break;
         case "CHANNEL":
          $main = "CHANNEL";
          break;
         case "IMAGE":
          $main = "IMAGE";
          $rss_channel["IMAGE"] = array();
          break;
         case "ITEM":
          $main = "ITEMS";
          break;
         default:
          $currently_writing = $name;
          break;
        }
    }
    
    function endElement($parser, $name) {
        global $rss_channel, $currently_writing, $item_counter;
        $currently_writing = "";
        if ($name == "ITEM") {
         $item_counter++;
        }
    }
    
    function characterData($parser, $data) {
     global $rss_channel, $currently_writing, $main, $item_counter;
     if ($currently_writing != "") {
      switch($main) {
       case "CHANNEL":
        if (isset($rss_channel[$currently_writing])) {
         $rss_channel[$currently_writing] .= $data;
        } else {
         $rss_channel[$currently_writing] = $data;
        }
        break;
       case "IMAGE":
        if (isset($rss_channel[$main][$currently_writing])) {
         $rss_channel[$main][$currently_writing] .= $data;
        } else {
         $rss_channel[$main][$currently_writing] = $data;
        }
        break;
       case "ITEMS":
        if (isset($rss_channel[$main][$item_counter][$currently_writing])) {
         $rss_channel[$main][$item_counter][$currently_writing] .= $data;
        } else {
         $rss_channel[$main][$item_counter][$currently_writing] = $data;
        }
        break;
      }
     }
    }
    
    function curl_string ($url,$user_agent='Mozilla 4.0'){
    
    $ch = curl_init();
    
    curl_setopt ($ch, CURLOPT_URL, $url);
    curl_setopt ($ch, CURLOPT_USERAGENT, $user_agent);
    curl_setopt ($ch, CURLOPT_HEADER, 0);
    curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt ($ch, CURLOPT_TIMEOUT, 120);
    $result = curl_exec ($ch);
    curl_close($ch);
    return $result;
    }
    
    $data=curl_string($file);
    $xml_parser = xml_parser_create();
    xml_set_element_handler($xml_parser, "startElement", "endElement");
    xml_set_character_data_handler($xml_parser, "characterData");
    
    if (!xml_parse($xml_parser, $data)) {
    die(sprintf("XML error: %s at line %d",
    xml_error_string(xml_get_error_code($xml_parser)),
    xml_get_current_line_number($xml_parser)));
    }
    xml_parser_free($xml_parser);
    
    if (isset($rss_channel["ITEMS"])) {
     if (count($rss_channel["ITEMS"]) > 0) {
      for($i = 0; $i < 5;$i++) {
      // end new
       if (isset($rss_channel["ITEMS"][$i]["LINK"])) {
       print ("\n<div class=\"itemtitle\"><a rel=\"nofollow\" style=\"color:#000000;\" target=\"_blank\" href=\"" . $rss_channel["ITEMS"][$i]["LINK"] . "\">" . $rss_channel["ITEMS"][$i]["TITLE"] . "</a></div>");
       } else {
       print ("\n<div class=\"itemtitle\">" . $rss_channel["ITEMS"][$i]["TITLE"] . "</div>");
       }
        print ("<div class=\"itemdescription\">" . $rss_channel["ITEMS"][$i]["DESCRIPTION"] . "</div><br />");   }
     } else {
      print ("No News Found");
     }
    }
    ?>
    Signature
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    {{ DiscussionBoard.errors[5510361].message }}
  • Profile picture of the author lordspace
    here's what you should do:

    download the RSS
    store it in mysql
    and then display it.

    when storing it in the db make sure you've set some of the fields to be unique e.g. post/article link so you don't fill the db with duplicates.
    Signature

    Are you using WordPress? Have you tried qSandbox yet?

    {{ DiscussionBoard.errors[5510799].message }}
  • Profile picture of the author mojojuju
    I see RSS mentioned a couple times, but I'm under the assumption that you want to scrape some html files.

    Originally Posted by JackPowers View Post

    Hi there,
    Seems like it's pretty straightforward to get the html file, but then next I would want to extract the data. The standard solution seems to be using regular expressions, but then I read some other guys suggesting not using PHP for this stuff at all but rather some Python library?
    There's easier ways than regular expressions. I used to use PHP Simple HTML DOM Parser but there's even better options out there, some of which are listed here.

    Python is great for this sort of thing and is better than PHP in lots of ways when it comes to having some really good libraries for text processing, but PHP is very well capable of doing any kind of web scraping task you might need it to do.

    Originally Posted by JackPowers View Post

    Next I would want to get the data to my website. Would you need to store it in a MySql database or you could you go straight from array to to website?
    I don't know enough about what you're doing in order to suggest anything. You could scrape some HTML and immediately store it on your web site, but in most circumstances that I can think of, you'd probably want to store it.
    Signature

    :)

    {{ DiscussionBoard.errors[5510970].message }}
    • Profile picture of the author Brandon Tanner
      As far as storing the info is concerned... if you're new to PHP then you'll find it much easier to write info to flat files (plain text files), rather than learn all about MySQL and databases, etc. When you first try to tackle MySQL, it can really make your head spin!
      Signature

      {{ DiscussionBoard.errors[5512608].message }}
  • Profile picture of the author Nochek
    Ew. Ew. Ew.

    Do NOT Use Regex To Scrape HTML.

    If you absolutely have to use PHP, use the SimpleHTMLDOM framework and save yourself a lifetime of heartache. The internet as a whole is malformed and invalid, don't get caught up trying to right 12 line expressions just to get a href link.

    *Edit - I didn't see Mojojuju's post above that said the same exact thing I did :p

    As an extra alternative, I personally use the HTMLAgilityPack and scrape things with C# applications, then feed them into my database.

    And while I can agree with Brandon on MySQL being difficult to get your mind around when you start it, I would argue that in the long run, learning MySQL commands is going to be just as difficult as learning file_get_contents and all the various approaches to writing, parsing, and correctly identifying flat files, and in the end will most likely be more expensive to perform.

    Taking the extra steps to learn how to do it correctly may make the process take longer, but in the end will make for a much better product.
    Signature
    Nochek Solutions Presents:
    The Hydrurga WSO - Rank Your Site #1 And Score Over The Penguin Updates!
    {{ DiscussionBoard.errors[5514050].message }}
    • Profile picture of the author phpg
      So why you can't regex to scrape html? Just because someone on stackoverflow says so (btw, there are examples in that very thread using regex)?

      In php, parsing html with regex is faster and less resource intensive than any parser library that can work with not well-formed html and understand errors in the way similar to a web browser. Especially if you have a predefined set of sites you'd like to parse and can write a "parser" for each site serving as a template.

      Of course you have to master regular expressions first: Mastering Regular Expressions*-*O'Reilly Media

      However, if you can use python, it's much better for this kind of tasks, and with python you don't need to use regex. There are several very good libraries for this, like Beautiful Soup
      {{ DiscussionBoard.errors[6198858].message }}
  • Profile picture of the author mimin
    I always use regex, cURL and a few site can be done with jSon..
    {{ DiscussionBoard.errors[6199275].message }}
  • Profile picture of the author lordspace
    I am also up for the regular expressions because the whole HTML file may not be 100% valid ... when parsing one must be looking for exact tags.

    Ideally, if Jack can get in touch with owners of the sites and they can put some HTML comments (see below) parsing would be relatively easy and as long as the HTML comments stay Jack's script will continue to work.

    Code:
    <!-- some_content -->
    <h2>Some title</h2>
    <div>Some content</div>
    <!-- /some_content -->
    P.S. with console scripts I've always found that using a logger helps a lot when troubleshooting bugs. This could be using a custom my_logger() function to append to existing file at certain points of the script.
    Signature

    Are you using WordPress? Have you tried qSandbox yet?

    {{ DiscussionBoard.errors[6204394].message }}
  • Profile picture of the author JayWiz
    The best combo:
    1. curl for scrapping. It's fast and can use proxy also, there are many curl class created and you only need to use it.
    2. Simple HTML DOM or Regex for parsing or filtering the result that you want. You can then enter into database.

    Hope this helps.
    {{ DiscussionBoard.errors[6205506].message }}
    • Profile picture of the author IM Gourmet
      I'm very surprised no-one's suggesting Ruby + nokugiri here. It's incredibly powerful for website scraping.

      I just looked up one complex, multi-page scrape script I wrote, and it's 41 lines for the entire thing.
      {{ DiscussionBoard.errors[6205692].message }}
      • Profile picture of the author Lovelogic
        The Yahoo Pipes service is also often overlooked, though supposed to be making RSS feed mashups it can also double up as a low volume page scraper.

        Though the user-agent string is preset by Yahoo and cannot be altered this is not a bad thing because typically when webmasters see it their logs coming from an IP range known to belong to Yahoo they wrongly assume its a bona fide search bot and let it through any defences.
        {{ DiscussionBoard.errors[6206814].message }}

Trending Topics