Uncovering discrepancies between Google's indexed pages and a site's actual pages?

by carlos123

Posted: 12 years ago 4 replies

PROGRAMMING

I have a client who has handed me quite the doozy of a task and I am not altogether sure as to what is the best way to help them with what they need so I thought I would ask here for any suggestions anyone might care to give me.

The task involves scraping or otherwise getting a list of all pages indexed by Google for a site and comparing that list of URL's to actual page URL's at the site such that any discrepancies between the two can be discerned and corrected with 301's or whatever.

As to the first part...here is what seems best.

- Create a PHP script to scrape the URL's from one SERP results page.
- go to Google and using the site: operator get Google to return all pages from a site that it has in it's index (100 results at a time).
- copy each such SERP page of 100 results one at a time to a file in a directory until I have gone through all Google results for a given site.
- run a script on all files in that directory to scrape each page copied and to spit out a CSV file containing all indexed pages.

For the second part...it would seem that I need to run a web crawler on the site to get a list of it's present pages.

Anybody know of a better way to do this?

Carlos

#actual #discrepancies #google #indexed #pages #site #uncovering

careybaird 12 years ago

You might be able to get the informaton from Google Webmaster Tools. It will be tough to scrape Googles SERPs and you will have to consider different IPs, maybe even the API.

In terms of the actual site - is it database driven? Wordpress? If so you should be able to programatically get the site page list. If there is an XML sitemap generated you could also use this.

Looking at the bigger picture.. an XML sitemap is the best way to ensure that Google is indexing the correct URL of your website.
- Thanks
- 1 reply
Signature

Owner of:

[Fresh Store Builder]

The worlds most advanced Amazon store builder with over 17,000 members.
{{ DiscussionBoard.errors[5342148].message }}
- carlos123 12 years ago
  
  Thanks for the input!
  
  I'll look into some of what you suggested.
  
  Carlos
  
  Thanks
  
  {{ DiscussionBoard.errors[5343682].message }}
pbarnhart 12 years ago

Three words for dealing with your own site: Xenu Link Sleuth!

As for the pages indexed by Google - here is a simple hack that involves using the Google site: option and simply saving the resultant pages:

List Google Pages Indexed for SEO: Two Step How To | End Point Blog

You may need to adjust the code a bit to account for Google's modified layout.
- Thanks
- 1 reply
{{ DiscussionBoard.errors[5343765].message }}
- carlos123 12 years ago
  
  Thanks PBarnHart. I found that page in my own Googling yesterday. It's definitely along the lines of what I am thinking might be best though I will use PHP instead of sed.
  
  The comments on that page are very important to read by the way as the script that is described there apparently triggers Google automated detection protection and could ban (temporarily) your IP from accessing Google.
  
  Oh...I run Linux so running Xenu isn't very practical for me as it only runs under Windows (I can run it under Wine and I may do that sometime but running Wine uses up a lot of computer resources on my laptop and I'd rather not do that).
  
  Carlos
  
  Thanks
  
  {{ DiscussionBoard.errors[5344312].message }}

Uncovering discrepancies between Google's indexed pages and a site's actual pages?

Trending Topics

Been a while! Any of you Old Timers here?

How much does the rebranding affect CR ?

Are billboards still effective in driving customers?

Boat and Jet Ski Rentals...... Door Hangers Effective?

A Very Strange Internet Problem