Verify Google Web Cache in PHP

by tpw
6 replies
Anyone have any ideas on how I would go about retrieving the Google Cache of a web page, in PHP?

I don't need the actual content. I am simply looking to verify if a particular page has been cached by Google.

A simple 1 or 0 is my desired output.




p.s. The last time I tried to Curl a Google web page a few years ago, I was blocked in my attempt. I assume this remains the case.

p.p.s. I do have a Google API Key, but cannot see any reference to cache in their documentation.
#cache #google #php #web
  • Profile picture of the author mojojuju
    Did you set a user agent string other than "Curl" when you downloaded from Google with Curl? If not, give it a shot. I'm pretty sure Google is going to flat out deny anything with a UA string like "Curl" or "Wget".

    I was using a google scraping script a couple of months ago just fine as long as I set a normal looking user agent. I also ran it through a proxy connected to TOR.
    Signature

    :)

    {{ DiscussionBoard.errors[2871865].message }}
  • Profile picture of the author tpw
    I have sent the precise user agent string that appears in my site logs for my computer.

    I have not run it through the proxy yet, because I have never looked at that before.


    p.s. You did get me thinking though... I am looking at Google Scraper codes now.


    I have been going through scrapers, and I have the code, but all require proxy urls and ports... I am at a loss to find valide proxies that I can use in my coding.
    Signature
    Bill Platt, Oklahoma USA, PlattPublishing.com
    Publish Coloring Books for Profit (WSOTD 7-30-2015)
    {{ DiscussionBoard.errors[2871973].message }}
    • Profile picture of the author tpw
      [DELETED]
      {{ DiscussionBoard.errors[2872729].message }}
      • Profile picture of the author mojojuju
        Originally Posted by tpw View Post

        I have been going through scrapers, and I have the code, but all require proxy urls and ports... I am at a loss to find valide proxies that I can use in my coding.
        I'm using TOR plus polipo (a proxy) to provide me with a virtually unlimited supply of IP addresses. Although, I haven't scripted anything yet, I'm using Curl successfully from the command line like this:

        Code:
        curl -x 127.0.0.1:8118 -A 'Mozilla/4.05 [en] (X11; U; Linux 2.0.32 i586)' http://webcache.googleusercontent.com/search?q=cache:www.warriorforum.com
        edit: Actually, don't use the search url with "webcache.googleusercontent.com" above. To see if a page is cached, just do a regular google search for the url. Scrape the result to see if the url is in the results. If it's in the results, then scrape for the word "cache" and the link to the cached page to verify if the page is available in google's cache.
        Signature

        :)

        {{ DiscussionBoard.errors[2873619].message }}
  • Profile picture of the author tpw
    Thank you for your help.

    I never was able to pull the cache page or get the cache date, but using the technique you described, I was able to verify that google had the URL, done with the scraping tool using Snoopy.

    Thank you again.
    Signature
    Bill Platt, Oklahoma USA, PlattPublishing.com
    Publish Coloring Books for Profit (WSOTD 7-30-2015)
    {{ DiscussionBoard.errors[2874353].message }}
    • Profile picture of the author DrMIC
      Hi Guys ... If I could jump in here....

      I am looking to hit the webcache directly as well and have used a few different strategies, many of them you have mentioned.

      Am I reading correctly that the only way to successfully do this is via proxy masking as the IP address will be blocked again and again?

      Is there a curl alternative that might be successful?

      Reply or PM

      Thank you
      -drMic
      {{ DiscussionBoard.errors[2892967].message }}

Trending Topics