Google Search Results Scraping bot

by chi124
32 replies
  • SEO
  • |
Hey guys,

I am in process of creating a tool that scrapes google for search rankings and the page found on google. I read that this violates Google's Guidelines.

I want to follow best guidelines on this as I do not want to be on Googles bad side.

My question is...
What is the best way to have this tool do what I want without violating Google guidelines.

How do these enterprise tools get away with scraping google for rankings?
#bot #google #results #scraping #search
  • Profile picture of the author patadeperro
    Originally Posted by chi124 View Post

    Hey guys,

    I am in process of creating a tool that scrapes google for search rankings and the page found on google. I read that this violates Google's Guidelines.

    I want to follow best guidelines on this as I do not want to be on Googles bad side.

    My question is...
    What is the best way to have this tool do what I want without violating Google guidelines.

    How do these enterprise tools get away with scraping google for rankings?
    Yes there is a way to do it:

    https://developers.google.com/custom...pi/v1/overview
    {{ DiscussionBoard.errors[10221495].message }}
    • Profile picture of the author yukon
      Banned
      Originally Posted by patadeperro View Post

      Yes there is a way to do it:
      That's not even remotely close to being what OP is asking about (live SERPs).

      Google Custom Search is an embedded search for a single website internal pages, it's basically a cached version (old data) of a site:domain.com search query.
      Signature
      Hi
      {{ DiscussionBoard.errors[10221669].message }}
      • Profile picture of the author patadeperro
        Originally Posted by yukon View Post

        That's not even remotely close to being what OP is asking about (live SERPs).

        Google Custom Search is an embedded search for a single website internal pages, it's basically a cached version (old data) of a site:domain.com search query.
        Not really, in theory that is the replace of this one:

        https://developers.google.com/web-search/docs/?hl=en

        That it was the tool to get the informatin from the google results, I have not used the new API so I cant confirm it.

        Edit:

        I thin you are right Yukon, is the search technology to deliver resulsts within your website.

        To answer Op questions, depending on what you want to do I would make requests with some delay time and masking my robot as an internet browser
        {{ DiscussionBoard.errors[10221727].message }}
    • Profile picture of the author chi124
      I am sure Moz does not use the google api for jhis.

      https://moz.com/blog/local-rankings-in-moz-analytics
      or other enterprise seo software like searchmetrics or brightedge.

      How do they get around of Google's TOS
      {{ DiscussionBoard.errors[10221742].message }}
  • Profile picture of the author DevenderA
    I think the tool is already developed named ScrapeBox ..... lol
    {{ DiscussionBoard.errors[10221740].message }}
    • Profile picture of the author chi124
      Haha yes I know there are tools for this but how do they get away with not violating Google's TOS. This tool I am creating is just a piece that I need that goes into a much larger tool.
      {{ DiscussionBoard.errors[10221749].message }}
      • Profile picture of the author yukon
        Banned
        Originally Posted by chi124 View Post

        Haha yes I know there are tools for this but how do they get away with not violating Google's TOS. This tool I am creating is just a piece that I need that goes into a much larger tool.
        You're not scraping Google SERPs without violating their TOS.

        My question is why do you care about a SERP TOS when Google scrapes your own websites all day long without permission? Look at your server log files, Google is eating up a large percentage of bandwidth.
        Signature
        Hi
        {{ DiscussionBoard.errors[10221753].message }}
        • Profile picture of the author chi124
          Can't you add a robot.txt file that would block bots from google searching your website? So it is a mere choice?
          {{ DiscussionBoard.errors[10221765].message }}
          • Profile picture of the author yukon
            Banned
            Originally Posted by chi124 View Post

            Can't you add a robot.txt file that would block bots from google searching your website? So it is a mere choice?
            Sure but that defeats the whole purpose of creating a scraper bot. If your own page isn't indexed, what's the point?
            Signature
            Hi
            {{ DiscussionBoard.errors[10221769].message }}
            • Profile picture of the author chi124
              Im am trying to think of how I can answer to Google if I violate their service. Google would say if you dont want us to scrape your website you can add robot.txt.

              Yes obviously we would lose out on performance.

              But Google's is giving us a choice to not have them scrape our website if we choose by blocking goooglebot.

              Google has said they do not want scrapers as it violates their terms of service.

              I guess this is a very gray area that really needs to be clarified.

              Rand Fiskins Response to Google's TOS (scroll down to comments look for guy with funny mustach)
              https://moz.com/community/q/seomoz-r...check-rankings

              Also funny response about Scraping.
              Google Trolls Itself in Attempt to End Website Scraping
              {{ DiscussionBoard.errors[10221786].message }}
              • Profile picture of the author MikeFriedman
                There is no way to get around their TOS. Anything scraping the results is against their TOS. End of story. If you are going to create something that checks ranks, you will have to do so violating their TOS, just like every other rank checker currently does.

                It's why RavenTools dropped their rank tracker when Google adjusted their TOS.

                That is all there is to it.
                Signature
                New Private Mastermind Group Discussing SEO, Local SEO, Google Ads, Facebook Ads, and more -
                Open for A Limited Time!

                Request to Join
                {{ DiscussionBoard.errors[10221810].message }}
  • Profile picture of the author trevord92
    Google scrapes other sites - it's against it's terms of service for you to scrape Google. They'd prefer it that the scraping is only in one direction with them being the scraper.

    Read some of the (many) articles like this one that cover scraping.

    You'll need proxies, time delays and something that deals with the pretty messy HTML that gets sent back because their computer is in a rush to deliver results
    {{ DiscussionBoard.errors[10221741].message }}
  • Profile picture of the author yukon
    Banned
    The SERP results are straight forward, If you view a Google search query as a text only webpage you'll see the results are simply an HTML bullet list. Granted there's also a bunch of HTML that needs filtered out.

    My bet is Google doesn't like these DIY scrapers because they skew Adwords data (fake traffic) & eat up server bandwidth that will never click on an Ad (fake traffic).

    Ditto on the proxies, that's the only way you won't get a temporary IP block, which is no big deal, just a hassle.
    Signature
    Hi
    {{ DiscussionBoard.errors[10221748].message }}
  • Profile picture of the author yukon
    Banned
    I don't think you'll get a straight answer because there's not a straight answer regarding SERP TOS. Google has the right to block a scraper bot just the same as a webmaster so I don't see how they can complain with the whole Do as I say, not as I do nonsense.

    Just curious, is this a single web based bot or are you distributing software?
    Signature
    Hi
    {{ DiscussionBoard.errors[10221817].message }}
    • Profile picture of the author chi124
      It will be based on a broweser. It will not be a software package that they download but a login to a site that they can access this data.

      Like many of the SEO enterprise tools out there that I referred to above.
      {{ DiscussionBoard.errors[10221824].message }}
  • Profile picture of the author nmwf
    Originally Posted by chi124 View Post

    Hey guys,

    I am in process of creating a tool that scrapes google for search rankings and the page found on google. I read that this violates Google's Guidelines.

    I want to follow best guidelines on this as I do not want to be on Googles bad side.
    What is the end-goal of all of this??
    Signature
    Write comprehensible articles on *any* topic in seconds with First Draft...
    First Draft's: Download | Add-Ons | Templates | Purchase | Support | Affiliates
    {{ DiscussionBoard.errors[10222287].message }}
  • Profile picture of the author nmwf
    TIL: Scraping Google is against its TOS. Do you think the better-known spinners, re-writers, and scrapers know that? Cause.... well, that's a pretty significant issue!
    Signature
    Write comprehensible articles on *any* topic in seconds with First Draft...
    First Draft's: Download | Add-Ons | Templates | Purchase | Support | Affiliates
    {{ DiscussionBoard.errors[10222291].message }}
    • Profile picture of the author MikeFriedman
      Originally Posted by nmwf View Post

      TIL: Scraping Google is against its TOS. Do you think the better-known spinners, re-writers, and scrapers know that? Cause.... well, that's a pretty significant issue!
      Of course they know. They just don't care. It's why when you use something like Scrapebox, you need proxies. When Google detects it, they block the IP being used.
      Signature
      New Private Mastermind Group Discussing SEO, Local SEO, Google Ads, Facebook Ads, and more -
      Open for A Limited Time!

      Request to Join
      {{ DiscussionBoard.errors[10222880].message }}
  • Profile picture of the author trevord92
    Most bigger websites block scraping and automated attempts to get the info from their site.

    It's been happening for years.

    You won't know what they do or how long the ban is for until it happens. With Google it's fairly short (at least for a "first offence") and come be overcome by completing a Captcha.

    Other sites it's longer.

    If your IP address rotates, like mine does, then you may even encounter a ban on a site when you've done nothing wrong but the previous user of the IP address has.
    {{ DiscussionBoard.errors[10222890].message }}
    • Profile picture of the author paulgl
      Google is not a law enforcing organization.

      You can violate their TOS on this all day long. So what?

      The silly thing is, scraping results is pointless in 2015.

      One reason why google decided to hide 99% of this info,
      as webmasters could not possibly decipher the madness
      to effectively use it. Why? Because there are TOO MANY
      intangibles when it comes to search.

      From location to time, from news to your browsing history.

      All is taken into account when one searches.

      Strip all that out, and what have you got? An almost
      insignificant amount of search info.

      Then you toss in mobile.

      Google is absolutely right on this one. There is no
      such thing as generic search in 2015 for 99.99% of
      REAL PEOPLE doing google searches.

      Nothing to see here, move long.

      Paul
      Signature

      If you were disappointed in your results today, lower your standards tomorrow.

      {{ DiscussionBoard.errors[10222924].message }}
      • Profile picture of the author yukon
        Banned
        Originally Posted by paulgl View Post

        Google is not a law enforcing organization.

        You can violate their TOS on this all day long. So what?

        The silly thing is, scraping results is pointless in 2015.

        One reason why google decided to hide 99% of this info,
        as webmasters could not possibly decipher the madness
        to effectively use it. Why? Because there are TOO MANY
        intangibles when it comes to search.

        From location to time, from news to your browsing history.

        All is taken into account when one searches.

        Strip all that out, and what have you got? An almost
        insignificant amount of search info.

        Then you toss in mobile.

        Google is absolutely right on this one. There is no
        such thing as generic search in 2015 for 99.99% of
        REAL PEOPLE doing google searches.

        Nothing to see here, move long.

        Paul
        That's all true & I get what you're saying but If someone trying to rank a page #1, #2, #3, etc... for traffic keywords it's nice to know the current rank status without manually sifting through SERPs, especially If you're working on multiple keywords at the same time or even multiple keywords on multiple domains. All that manual data checking is time consuming.

        I have keywords ranked that still show they're ranked regardless of desktop or mobile, even show ranked on both personalized or a clean cache/history on other folks mobile devices.

        Anyways, nothing is perfect but having an idea where pages are ranked is useful.
        Signature
        Hi
        {{ DiscussionBoard.errors[10223361].message }}
      • Profile picture of the author chi124
        Yes I know the rankings can be inaccurate many times but like Yukon said it still has some pretty good use cases.

        Here is my approach on how I would use this tool and the end goal.

        I look at my website as an answer and keywords as questions. Is my website answering those questions? This tool helps me scale up the tedious task of typing this into Google.

        To answer the end goal ? I Am in the process of building a tool that uses your google analytics and adwords to find effective keyword research. This helps you with SEO, PPC, Social, Email, content marketing with really good topics to write about.

        That is the reason for Google TOS question as if I create a tool for the masses I do not want to violate any terms. If it was just for myself than I could care less if I get banned. I still havent heard about how these big enterprise SEO companies are getting away with it.



        Cheers
        {{ DiscussionBoard.errors[10224078].message }}
        • Profile picture of the author MikeFriedman
          Originally Posted by chi124 View Post

          That is the reason for Google TOS question as if I create a tool for the masses I do not want to violate any terms. If it was just for myself than I could care less if I get banned. I still havent heard about how these big enterprise SEO companies are getting away with it.

          They are not "getting away with it". They are violating Google's TOS. They just don't care. Violating the TOS does not carry any penalty.

          Like I said, it was why RavenTools decided to drop their rank tracker. There is no way to get around it. They were one of the few that did though. Most everyone else, including Moz, just kept doing what they were doing.
          Signature
          New Private Mastermind Group Discussing SEO, Local SEO, Google Ads, Facebook Ads, and more -
          Open for A Limited Time!

          Request to Join
          {{ DiscussionBoard.errors[10224112].message }}
  • Profile picture of the author KHR
    There are many tools that Scraps From Google. As far as I know Google Always Hate Scraping.

    If you have Good Time Delay there should not be any problem.
    {{ DiscussionBoard.errors[10225502].message }}
    • Profile picture of the author nettiapina
      Originally Posted by chi124 View Post

      That is the reason for Google TOS question as if I create a tool for the masses I do not want to violate any terms. If it was just for myself than I could care less if I get banned. I still havent heard about how these big enterprise SEO companies are getting away with it.
      You know that the same people who offer desktop SERP trackers also often offer proxies, right? This means that they know very well that their product is technically against the TOS, and that Google is trying to throttle their clients. And as was already pointed out, they've decided to not care.

      "Enterprise SEO companies" just run a server or a bunch of VPS that push the queries through a bunch of proxies. They're running the same kind of automation tools than everyone else.

      It's understandable that Google wants to limit queries, but they could be much harsher if they wanted to. I think that they simply see the value in SEO. This whole forum is essentially devoted to discussion about their flagship product, and there's hundreds more. The "Enterprise SEO companies" are selling Google to their clients. You can't really buy this kind of publicity. They're going to tolerate this parasitic industry even with all the spam and server capacity that goes to bots.

      Why TOS? Well, if some script kiddies step over the line they can smack them over the head with it.
      Signature
      Links in signature will not help your SEO. Not on this site, and not on any other forum.
      Who told me this? An ex Google web spam engineer.

      What's your excuse?
      {{ DiscussionBoard.errors[10225666].message }}
  • Profile picture of the author patadeperro
    To me the easiest way to spread your risk is to use the API of many of those SEO softwares, some of them allow you to make certain number of calls per month for free, here is a list of 45 SEO related sites with APIs, instructions, endpoints etc...

    45 SEO APIs: Google AdSense, Alexa and Yahoo Site Explorer | ProgrammableWeb

    Even the list is a little bit outdated it can help you.

    Cheers.
    {{ DiscussionBoard.errors[10225912].message }}
    • Profile picture of the author yukon
      Banned
      Originally Posted by patadeperro View Post

      To me the easiest way to spread your risk is to use the API of many of those SEO softwares, some of them allow you to make certain number of calls per month for free, here is a list of 45 SEO related sites with APIs, instructions, endpoints etc...

      45 SEO APIs: Google AdSense, Alexa and Yahoo Site Explorer | ProgrammableWeb

      Even the list is a little bit outdated it can help you.

      Cheers.
      IMO APIs are horrible, they always break (eventually) & then buyers go complain to the 3rd party software developer that has zero control over the API data.

      Google SERPs is very basic, it's a webpage like any other webpage, just use some regex to parse the webpage HTML & filter down to the useful data. Scale up with proxies.

      Also, it's a horrible idea If a developer has to shell out money for an API that has data limits & that data is being used by the developers buyers, forget that mess, let the buyer buy proxies. That removes support headaches for the developer & puts the buyer in control of whether the software works or not (good proxies).
      Signature
      Hi
      {{ DiscussionBoard.errors[10226030].message }}
      • Profile picture of the author patadeperro
        Originally Posted by yukon View Post

        IMO APIs are horrible, they always break (eventually) & then buyers go complain to the 3rd party software developer that has zero control over the API data.

        Google SERPs is very basic, it's a webpage like any other webpage, just use some regex to parse the webpage HTML & filter down to the useful data. Scale up with proxies.

        Also, it's a horrible idea If a developer has to shell out money for an API that has data limits & that data is being used by the developers buyers, forget that mess, let the buyer buy proxies. That removes support headaches for the developer & puts the buyer in control of whether the software works or not (good proxies).
        I agree with you in certain parts Yukon, but not in all, here are my points:

        a) Apis are useful when they give you more data than you could get on your own, you can already leverage somebody elses ideas to improve them without starting from the scratch.

        b) Yes you can scrap Google with several tools (Xpath, or Perl modules) here the issue is the technical level you have and how you want to manipulate the data, if you want to integrate the ranks as part of a deeper analysis tool, why are you going to brake your head to do it when you can simply make a call to the API?

        c) Independently of the API use I think having your own proxies is important.
        {{ DiscussionBoard.errors[10226052].message }}
        • Profile picture of the author yukon
          Banned
          Originally Posted by patadeperro View Post

          I agree with you in certain parts Yukon, but not in all, here are my points:

          a) Apis are useful when they give you more data than you could get on your own, you can already leverage somebody elses ideas to improve them without starting from the scratch.

          b) Yes you can scrap Google with several tools (Xpath, or Perl modules) here the issue is the technical level you have and how you want to manipulate the data, if you want to integrate the ranks as part of a deeper analysis tool, why are you going to brake your head to do it when you can simply make a call to the API?

          c) Independently of the API use I think having your own proxies is important.


          a) I've already pointed out APIs break, it's not If but when. Guaranteed to happen sooner or later (RIP Yahoo Site Explorer). Again, it's just a bad business plan for a developer to eat the cost of an API, some APIs aren't free & have data caps.

          b) You're either an app developer or you're not. Regex is typically used by most developers building scrapers/bots.
          Signature
          Hi
          {{ DiscussionBoard.errors[10226102].message }}
          • Profile picture of the author patadeperro
            Originally Posted by yukon View Post

            a) I've already pointed out APIs break, it's not If but when. Guaranteed to happen sooner or later (RIP Yahoo Site Explorer). Again, it's just a bad business plan for a developer to eat the cost of an API, some APIs aren't free & have data caps.

            b) You're either an app developer or you're not. Regex is typically used by most developers building scrapers/bots.
            Yes the API can break, just like your entire application can break when the service itself dissapear, that is an inherent risk on the net (if Google search engine dissapears, api or not your application will be done).

            The point I think OP should weight in is not just his experience level as a developer , but what exactly he wants to create, if you want to make an application to geographically locate the site where videos are taken, why wouldn't you use youtube API and Google maps API (or openmaps, or bing)?

            To me the API give non expert developers an opportunity to draft their ideas, and if they work well expand them further (with or withot the API).

            Anyway, OP needs to evaluate exactly what he wants to do, what is his developping experience and how he wants to scalate his idea.
            {{ DiscussionBoard.errors[10226160].message }}

Trending Topics