PHP SCRAPING - INCONSISTENCIES

4 replies
Quick question on something that's baffling me..

I have one of my scrapers set up to scrape a site for thousands pieces of data...it seems that out of a couple thousand, it sometimes returns missing or blank on a few. It is inconsistent with the rest of the lot.

As an example, let's say I'm scarping a site that displays baseball player stats. My returns would be...

Mickey Mantle - NY Yankees - .357
Wade Boggs - Boston Red Sox - .385
((BLANK)) - LA Dodgers - .297
Mookie Wilson - NY Mets - .234

I've gone back to check the page source and there is nothing different that would create an instance where my parsing would cause the blank to occur.

I've also rerun the exact page through a testing scraper and it picks up the pieces just fine.

So I guess I'm going a long way to ask if PHP is accustomed to having slight irregularities or inconsistencies when dealing with multiple pieces of data?
#inconsistencies #php #scraping
  • Profile picture of the author Tashi Mortier
    How do you extract this data? Do you use regular expressions? Some HTML cleaning libraries?

    Maybe there are some accents on players names that aren't being covered by your regular expressions.

    I've written some PHP Scrapers myself in the past, so, yes, if some of your expressions dn't cover all possible cases, stuff like this can happen.
    Signature

    Want to read my personal blog? Tashi Mortier

    {{ DiscussionBoard.errors[4570374].message }}
  • Profile picture of the author SebastianJ
    I see a few possibilities here:

    1. Encoding issues - maybe the page is in ISO-8859-1 (or something similar) and your parser expects UTF-8 and as a result, somehow the input is garbled.

    2. Regex issues (as Tashi hinted) - Your regexes might not be able to parse a certain input string since a) the regex isn't written properly to parse utf-8/iso/whatever strings or b) doesn't account for special characters (accents etc.)

    3. Connection issues (not that likely) - Maybe the connection drops somehow and only the partial response is returned from the page you're scraping.

    4. Blocked by site - The site owner might have a system set up to stop people from excessively scraping the site (like e.g. EzineArticles). The first 3 responses are returned correctly but the last one returns partially scraped or completely garbled data.

    Are you using proxies by the way?
    {{ DiscussionBoard.errors[4570467].message }}
  • Profile picture of the author Big Squid
    Yes, I'm using proxies. I've tried it with proxy and without. Same problem. Interestingly, I tested transferred the files to another VPS and it worked perfectly. Perhaps it was the server I was using?
    {{ DiscussionBoard.errors[4572968].message }}
  • Profile picture of the author KirkMcD
    How are you debugging this? Are you saving all the downloaded pages and then examining/retrying the pages that didn't parse properly or are you just rerunning the script and redownloading the problem page?
    You need to save the pages and examine the ones that didn't work to see what's different.
    {{ DiscussionBoard.errors[4573381].message }}

Trending Topics