Uncovering discrepancies between Google's indexed pages and a site's actual pages?

4 replies
I have a client who has handed me quite the doozy of a task and I am not altogether sure as to what is the best way to help them with what they need so I thought I would ask here for any suggestions anyone might care to give me.

The task involves scraping or otherwise getting a list of all pages indexed by Google for a site and comparing that list of URL's to actual page URL's at the site such that any discrepancies between the two can be discerned and corrected with 301's or whatever.

As to the first part...here is what seems best.

- Create a PHP script to scrape the URL's from one SERP results page.
- go to Google and using the site: operator get Google to return all pages from a site that it has in it's index (100 results at a time).
- copy each such SERP page of 100 results one at a time to a file in a directory until I have gone through all Google results for a given site.
- run a script on all files in that directory to scrape each page copied and to spit out a CSV file containing all indexed pages.

For the second part...it would seem that I need to run a web crawler on the site to get a list of it's present pages.

Anybody know of a better way to do this?

Carlos
#actual #discrepancies #google #indexed #pages #site #uncovering
  • Profile picture of the author careybaird
    You might be able to get the informaton from Google Webmaster Tools. It will be tough to scrape Googles SERPs and you will have to consider different IPs, maybe even the API.

    In terms of the actual site - is it database driven? Wordpress? If so you should be able to programatically get the site page list. If there is an XML sitemap generated you could also use this.

    Looking at the bigger picture.. an XML sitemap is the best way to ensure that Google is indexing the correct URL of your website.
    Signature

    Owner of:

    [
    Fresh Store Builder]

    The worlds most advanced Amazon store builder with over 17,000 members.

    {{ DiscussionBoard.errors[5342148].message }}
    • Profile picture of the author carlos123
      Thanks for the input!

      I'll look into some of what you suggested.

      Carlos
      {{ DiscussionBoard.errors[5343682].message }}
  • Profile picture of the author pbarnhart
    Three words for dealing with your own site: Xenu Link Sleuth!

    As for the pages indexed by Google - here is a simple hack that involves using the Google site: option and simply saving the resultant pages:

    List Google Pages Indexed for SEO: Two Step How To | End Point Blog

    You may need to adjust the code a bit to account for Google's modified layout.
    {{ DiscussionBoard.errors[5343765].message }}
    • Profile picture of the author carlos123
      Thanks PBarnHart. I found that page in my own Googling yesterday. It's definitely along the lines of what I am thinking might be best though I will use PHP instead of sed.

      The comments on that page are very important to read by the way as the script that is described there apparently triggers Google automated detection protection and could ban (temporarily) your IP from accessing Google.

      Oh...I run Linux so running Xenu isn't very practical for me as it only runs under Windows (I can run it under Wine and I may do that sometime but running Wine uses up a lot of computer resources on my laptop and I'd rather not do that).

      Carlos
      {{ DiscussionBoard.errors[5344312].message }}

Trending Topics