Verify Google Web Cache in PHP

by tpw

Posted: 13 years ago 6 replies

PROGRAMMING

Anyone have any ideas on how I would go about retrieving the Google Cache of a web page, in PHP?

I don't need the actual content. I am simply looking to verify if a particular page has been cached by Google.

A simple 1 or 0 is my desired output.

p.s. The last time I tried to Curl a Google web page a few years ago, I was blocked in my attempt. I assume this remains the case.

p.p.s. I do have a Google API Key, but cannot see any reference to cache in their documentation.

#cache #google #php #web

mojojuju 13 years ago

Did you set a user agent string other than "Curl" when you downloaded from Google with Curl? If not, give it a shot. I'm pretty sure Google is going to flat out deny anything with a UA string like "Curl" or "Wget".

I was using a google scraping script a couple of months ago just fine as long as I set a normal looking user agent. I also ran it through a proxy connected to TOR.
- Thanks
Signature

:)
{{ DiscussionBoard.errors[2871865].message }}
tpw 13 years ago

I have sent the precise user agent string that appears in my site logs for my computer.

I have not run it through the proxy yet, because I have never looked at that before.

p.s. You did get me thinking though... I am looking at Google Scraper codes now.

I have been going through scrapers, and I have the code, but all require proxy urls and ports... I am at a loss to find valide proxies that I can use in my coding.
- Thanks
Signature

Bill Platt, Oklahoma USA, PlattPublishing.com
Publish Coloring Books for Profit (WSOTD 7-30-2015)
{{ DiscussionBoard.errors[2871973].message }}
- tpw 13 years ago
  
  [DELETED]
  
  {{ DiscussionBoard.errors[2872729].message }}
  
  mojojuju 13 years ago
  
  Originally Posted by tpw
  
  I have been going through scrapers, and I have the code, but all require proxy urls and ports... I am at a loss to find valide proxies that I can use in my coding.
  
  I'm using TOR plus polipo (a proxy) to provide me with a virtually unlimited supply of IP addresses. Although, I haven't scripted anything yet, I'm using Curl successfully from the command line like this:
  
  Code:
  
  curl -x 127.0.0.1:8118 -A 'Mozilla/4.05 [en] (X11; U; Linux 2.0.32 i586)' http://webcache.googleusercontent.com/search?q=cache:www.warriorforum.com
  
  edit: Actually, don't use the search url with "webcache.googleusercontent.com" above. To see if a page is cached, just do a regular google search for the url. Scrape the result to see if the url is in the results. If it's in the results, then scrape for the word "cache" and the link to the cached page to verify if the page is available in google's cache.
  
  Thanks
  
  Signature
  
  :)
  
  {{ DiscussionBoard.errors[2873619].message }}
tpw 13 years ago

Thank you for your help.

I never was able to pull the cache page or get the cache date, but using the technique you described, I was able to verify that google had the URL, done with the scraping tool using Snoopy.

Thank you again.
- Thanks
- 1 reply
Signature

Bill Platt, Oklahoma USA, PlattPublishing.com
Publish Coloring Books for Profit (WSOTD 7-30-2015)
{{ DiscussionBoard.errors[2874353].message }}
- DrMIC 13 years ago
  
  Hi Guys ... If I could jump in here....
  
  I am looking to hit the webcache directly as well and have used a few different strategies, many of them you have mentioned.
  
  Am I reading correctly that the only way to successfully do this is via proxy masking as the IP address will be blocked again and again?
  
  Is there a curl alternative that might be successful?
  
  Reply or PM
  
  Thank you
  -drMic
  
  Thanks
  
  {{ DiscussionBoard.errors[2892967].message }}

Verify Google Web Cache in PHP

Trending Topics

How do I effectively build and nurture an email list for affiliate marketing purposes?

Been a while! Any of you Old Timers here?

How much does the rebranding affect CR ?

Are billboards still effective in driving customers?

Boat and Jet Ski Rentals...... Door Hangers Effective?