Scraping websites - use PHP and Regexp or something else?

by JackPowers

Posted: 12 years ago 11 replies

PROGRAMMING

Hi there,

I am currently busy learning and doing PHP to help me make tools and other cool stuff for my websites. My first project I want to take on is to be able to make a website scraper to affiliate site script (all with permission from vendors of course).

Seems like it's pretty straightforward to get the html file, but then next I would want to extract the data. The standard solution seems to be using regular expressions, but then I read some other guys suggesting not using PHP for this stuff at all but rather some Python library?

Next I would want to get the data to my website. Would you need to store it in a MySql database or you could you go straight from array to to website?

I'm a newb with PHP though I do know programming basics, anyway, is the process I outlined above the right way to do it? I don't want to be headed down the wrong path!

#php #regexp #scraping #websites

kokopelli

12 years ago

I suggest you look at how some other scripts do it, e.g. CaRP Evolution | PHP RSS Parser / RSS to HTML Converter | Free Download

And here's a simple parser script I sometimes use:

Code:

<?php
set_time_limit(0);

$file = "http://www.nytimes.com/services/xml/rss/nyt/RealEstate.xml";

$rss_channel = array();
$currently_writing = "";
$main = "";
$item_counter = 0;

function startElement($parser, $name, $attrs) {
    global $rss_channel, $currently_writing, $main;
    switch($name) {
     case "RSS":
     case "RDF:RDF":
     case "ITEMS":
      $currently_writing = "";
      break;
     case "CHANNEL":
      $main = "CHANNEL";
      break;
     case "IMAGE":
      $main = "IMAGE";
      $rss_channel["IMAGE"] = array();
      break;
     case "ITEM":
      $main = "ITEMS";
      break;
     default:
      $currently_writing = $name;
      break;
    }
}

function endElement($parser, $name) {
    global $rss_channel, $currently_writing, $item_counter;
    $currently_writing = "";
    if ($name == "ITEM") {
     $item_counter++;
    }
}

function characterData($parser, $data) {
 global $rss_channel, $currently_writing, $main, $item_counter;
 if ($currently_writing != "") {
  switch($main) {
   case "CHANNEL":
    if (isset($rss_channel[$currently_writing])) {
     $rss_channel[$currently_writing] .= $data;
    } else {
     $rss_channel[$currently_writing] = $data;
    }
    break;
   case "IMAGE":
    if (isset($rss_channel[$main][$currently_writing])) {
     $rss_channel[$main][$currently_writing] .= $data;
    } else {
     $rss_channel[$main][$currently_writing] = $data;
    }
    break;
   case "ITEMS":
    if (isset($rss_channel[$main][$item_counter][$currently_writing])) {
     $rss_channel[$main][$item_counter][$currently_writing] .= $data;
    } else {
     $rss_channel[$main][$item_counter][$currently_writing] = $data;
    }
    break;
  }
 }
}

function curl_string ($url,$user_agent='Mozilla 4.0'){

$ch = curl_init();

curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt ($ch, CURLOPT_HEADER, 0);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt ($ch, CURLOPT_TIMEOUT, 120);
$result = curl_exec ($ch);
curl_close($ch);
return $result;
}

$data=curl_string($file);
$xml_parser = xml_parser_create();
xml_set_element_handler($xml_parser, "startElement", "endElement");
xml_set_character_data_handler($xml_parser, "characterData");

if (!xml_parse($xml_parser, $data)) {
die(sprintf("XML error: %s at line %d",
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser)));
}
xml_parser_free($xml_parser);

if (isset($rss_channel["ITEMS"])) {
 if (count($rss_channel["ITEMS"]) > 0) {
  for($i = 0; $i < 5;$i++) {
  // end new
   if (isset($rss_channel["ITEMS"][$i]["LINK"])) {
   print ("\n<div class=\"itemtitle\"><a rel=\"nofollow\" style=\"color:#000000;\" target=\"_blank\" href=\"" . $rss_channel["ITEMS"][$i]["LINK"] . "\">" . $rss_channel["ITEMS"][$i]["TITLE"] . "</a></div>");
   } else {
   print ("\n<div class=\"itemtitle\">" . $rss_channel["ITEMS"][$i]["TITLE"] . "</div>");
   }
    print ("<div class=\"itemdescription\">" . $rss_channel["ITEMS"][$i]["DESCRIPTION"] . "</div><br />");   }
 } else {
  print ("No News Found");
 }
}
?>

[ 1 ] Thanks

Signature

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

{{ DiscussionBoard.errors[5510361].message }}

lordspace 12 years ago

here's what you should do:

download the RSS
store it in mysql
and then display it.

when storing it in the db make sure you've set some of the fields to be unique e.g. post/article link so you don't fill the db with duplicates.
- [ 1 ] Thanks
Signature

Are you using WordPress? Have you tried qSandbox yet?
{{ DiscussionBoard.errors[5510799].message }}

mojojuju

12 years ago

I see RSS mentioned a couple times, but I'm under the assumption that you want to scrape some html files.

Originally Posted by JackPowers

Hi there,
Seems like it's pretty straightforward to get the html file, but then next I would want to extract the data. The standard solution seems to be using regular expressions, but then I read some other guys suggesting not using PHP for this stuff at all but rather some Python library?

There's easier ways than regular expressions. I used to use PHP Simple HTML DOM Parser but there's even better options out there, some of which are listed here.

Python is great for this sort of thing and is better than PHP in lots of ways when it comes to having some really good libraries for text processing, but PHP is very well capable of doing any kind of web scraping task you might need it to do.

Originally Posted by JackPowers

Next I would want to get the data to my website. Would you need to store it in a MySql database or you could you go straight from array to to website?

I don't know enough about what you're doing in order to suggest anything. You could scrape some HTML and immediately store it on your web site, but in most circumstances that I can think of, you'd probably want to store it.

[ 1 ] Thanks
1 reply

Signature

{{ DiscussionBoard.errors[5510970].message }}

Brandon Tanner 12 years ago

As far as storing the info is concerned... if you're new to PHP then you'll find it much easier to write info to flat files (plain text files), rather than learn all about MySQL and databases, etc. When you first try to tackle MySQL, it can really make your head spin!
- Thanks
Signature
{{ DiscussionBoard.errors[5512608].message }}

Nochek 12 years ago

Ew. Ew. Ew.

Do NOT Use Regex To Scrape HTML.

If you absolutely have to use PHP, use the SimpleHTMLDOM framework and save yourself a lifetime of heartache. The internet as a whole is malformed and invalid, don't get caught up trying to right 12 line expressions just to get a href link.

*Edit - I didn't see Mojojuju's post above that said the same exact thing I did :p

As an extra alternative, I personally use the HTMLAgilityPack and scrape things with C# applications, then feed them into my database.

And while I can agree with Brandon on MySQL being difficult to get your mind around when you start it, I would argue that in the long run, learning MySQL commands is going to be just as difficult as learning file_get_contents and all the various approaches to writing, parsing, and correctly identifying flat files, and in the end will most likely be more expensive to perform.

Taking the extra steps to learn how to do it correctly may make the process take longer, but in the end will make for a much better product.
- Thanks
- 1 reply
Signature

Nochek Solutions Presents:
The Hydrurga WSO - Rank Your Site #1 And Score Over The Penguin Updates!
{{ DiscussionBoard.errors[5514050].message }}
- phpg 12 years ago
  
  Originally Posted by Nochek
  
  Ew. Ew. Ew.
  
  html - RegEx match open tags except XHTML self-contained tags - Stack Overflow
  
  So why you can't regex to scrape html? Just because someone on stackoverflow says so (btw, there are examples in that very thread using regex)?
  
  In php, parsing html with regex is faster and less resource intensive than any parser library that can work with not well-formed html and understand errors in the way similar to a web browser. Especially if you have a predefined set of sites you'd like to parse and can write a "parser" for each site serving as a template.
  
  Of course you have to master regular expressions first: Mastering Regular Expressions*-*O'Reilly Media
  
  However, if you can use python, it's much better for this kind of tasks, and with python you don't need to use regex. There are several very good libraries for this, like Beautiful Soup
  
  Thanks
  
  {{ DiscussionBoard.errors[6198858].message }}
mimin 12 years ago

I always use regex, cURL and a few site can be done with jSon..
- Thanks
{{ DiscussionBoard.errors[6199275].message }}
lordspace 12 years ago
I am also up for the regular expressions because the whole HTML file may not be 100% valid ... when parsing one must be looking for exact tags.

Ideally, if Jack can get in touch with owners of the sites and they can put some HTML comments (see below) parsing would be relatively easy and as long as the HTML comments stay Jack's script will continue to work.
Code:

 <h2>Some title</h2> <div>Some content</div> 
P.S. with console scripts I've always found that using a logger helps a lot when troubleshooting bugs. This could be using a custom my_logger() function to append to existing file at certain points of the script.
- Thanks
Signature

Are you using WordPress? Have you tried qSandbox yet?
{{ DiscussionBoard.errors[6204394].message }}
JayWiz 12 years ago

The best combo:
1. curl for scrapping. It's fast and can use proxy also, there are many curl class created and you only need to use it.
2. Simple HTML DOM or Regex for parsing or filtering the result that you want. You can then enter into database.

Hope this helps.
- Thanks
- 1 reply
Signature

"Ultimate Keyword Research Software With Cash-Pulling Features, the Only Keyword Tool You Ever Need!"

"Who Else Wants To Learn The Hidden Secrets Of Quickly Turning 10 Minutes Into A $474.99/Month Income Generator?"
{{ DiscussionBoard.errors[6205506].message }}
- IM Gourmet 12 years ago
  
  I'm very surprised no-one's suggesting Ruby + nokugiri here. It's incredibly powerful for website scraping.
  
  I just looked up one complex, multi-page scrape script I wrote, and it's 41 lines for the entire thing.
  
  Thanks
  
  1 reply
  
  {{ DiscussionBoard.errors[6205692].message }}
  
  Lovelogic 12 years ago
  
  The Yahoo Pipes service is also often overlooked, though supposed to be making RSS feed mashups it can also double up as a low volume page scraper.
  
  Though the user-agent string is preset by Yahoo and cannot be altered this is not a bad thing because typically when webmasters see it their logs coming from an IP range known to belong to Yahoo they wrongly assume its a bona fide search bot and let it through any defences.
  
  Thanks
  
  {{ DiscussionBoard.errors[6206814].message }}

Scraping websites - use PHP and Regexp or something else?

Trending Topics

My WordPress website was hacked and this is what happened

Rank and Rent Website Business - Anyone Doing This?

Are billboards still effective in driving customers?

How do I effectively build and nurture an email list for affiliate marketing purposes?

Are Solo ADs the Most Cost Effective Way to Promote your Product?