PHP SCRAPING - INCONSISTENCIES

by Big Squid

Posted: 14 years ago 4 replies

PROGRAMMING

Quick question on something that's baffling me..

I have one of my scrapers set up to scrape a site for thousands pieces of data...it seems that out of a couple thousand, it sometimes returns missing or blank on a few. It is inconsistent with the rest of the lot.

As an example, let's say I'm scarping a site that displays baseball player stats. My returns would be...

Mickey Mantle - NY Yankees - .357
Wade Boggs - Boston Red Sox - .385
((BLANK)) - LA Dodgers - .297
Mookie Wilson - NY Mets - .234

I've gone back to check the page source and there is nothing different that would create an instance where my parsing would cause the blank to occur.

I've also rerun the exact page through a testing scraper and it picks up the pieces just fine.

So I guess I'm going a long way to ask if PHP is accustomed to having slight irregularities or inconsistencies when dealing with multiple pieces of data?

#inconsistencies #php #scraping

Tashi Mortier 14 years ago

How do you extract this data? Do you use regular expressions? Some HTML cleaning libraries?

Maybe there are some accents on players names that aren't being covered by your regular expressions.

I've written some PHP Scrapers myself in the past, so, yes, if some of your expressions dn't cover all possible cases, stuff like this can happen.
- Thanks
Signature

Want to read my personal blog? Tashi Mortier
{{ DiscussionBoard.errors[4570374].message }}
SebastianJ 14 years ago

I see a few possibilities here:

1. Encoding issues - maybe the page is in ISO-8859-1 (or something similar) and your parser expects UTF-8 and as a result, somehow the input is garbled.

2. Regex issues (as Tashi hinted) - Your regexes might not be able to parse a certain input string since a) the regex isn't written properly to parse utf-8/iso/whatever strings or b) doesn't account for special characters (accents etc.)

3. Connection issues (not that likely) - Maybe the connection drops somehow and only the partial response is returned from the page you're scraping.

4. Blocked by site - The site owner might have a system set up to stop people from excessively scraping the site (like e.g. EzineArticles). The first 3 responses are returned correctly but the last one returns partially scraped or completely garbled data.

Are you using proxies by the way?
- Thanks
{{ DiscussionBoard.errors[4570467].message }}
Big Squid 14 years ago

Yes, I'm using proxies. I've tried it with proxy and without. Same problem. Interestingly, I tested transferred the files to another VPS and it worked perfectly. Perhaps it was the server I was using?
- Thanks
{{ DiscussionBoard.errors[4572968].message }}
KirkMcD 14 years ago

How are you debugging this? Are you saving all the downloaded pages and then examining/retrying the pages that didn't parse properly or are you just rerunning the script and redownloading the problem page?
You need to save the pages and examine the ones that didn't work to see what's different.
- Thanks
{{ DiscussionBoard.errors[4573381].message }}

PHP SCRAPING - INCONSISTENCIES

Trending Topics

Don't set new years resolutions

How to add Backlink ? I am under thread of losing the job

Is Off-page SEO become hard as Ai Energizes

Which Social Platform Is Giving You the Best ROI This Year?

Digital Marketing & Instagram Growth -- Beginner Friendly Strategies?