Uncovering discrepancies between Google's indexed pages and a site's actual pages?
The task involves scraping or otherwise getting a list of all pages indexed by Google for a site and comparing that list of URL's to actual page URL's at the site such that any discrepancies between the two can be discerned and corrected with 301's or whatever.
As to the first part...here is what seems best.
- Create a PHP script to scrape the URL's from one SERP results page.
- go to Google and using the site: operator get Google to return all pages from a site that it has in it's index (100 results at a time).
- copy each such SERP page of 100 results one at a time to a file in a directory until I have gone through all Google results for a given site.
- run a script on all files in that directory to scrape each page copied and to spit out a CSV file containing all indexed pages.
For the second part...it would seem that I need to run a web crawler on the site to get a list of it's present pages.
Anybody know of a better way to do this?
Carlos
Owner of:
[Fresh Store Builder]
The worlds most advanced Amazon store builder with over 17,000 members.