4 replies
I have written a basic Amazon scraping program. I use a bar code scanner on the UPC, and the program retrieves data about that product. The program uses no scripts, it simply pulls HTML data.

The routine consists of six GET requests and downloads one image. Generally the program is not run more than three times per minute.

  • I have no time pauses between GET requests
  • I am not rotating user-agents between GET requests (can easily do this if needed)
  • I am not using any proxies
I do not believe I am putting a strain on Amazon servers. But I am wondering if I am at risk of an IP ban. Will Amazon take action against my puny little program?

Before I added the user agent to my requests, My requests would occasionally get blocked by Amazon with a message that told me to use their API for automated access.

I have been using the program for over a year but I have concerns that I am pushing my luck. I cannot risk losing my seller account. Before using proxies or a VPN, was wondering if it is really necessary in my case.
#concerns #scraping
  • Profile picture of the author KirkMcD
    And Why don't you want to use the Amazon API?
    {{ DiscussionBoard.errors[10308016].message }}
    • Profile picture of the author The Dead Guy
      Originally Posted by KirkMcD View Post

      And Why don't you want to use the Amazon API?
      The API has limitations to the pricing data it returns. The data I am collecting is public data, that I am using for my own personal offline use.
      Look! No links in my signature. This means I actually have something useful to say. I'm not just posting to be posting...
      {{ DiscussionBoard.errors[10310710].message }}
      • Profile picture of the author AboutTown
        For those 6 pages you scrape. Do they follow a process that could be carried out naturally by following links in a web browser? If so I would suggest adding a small delay between the requests to simulate the time it would take a user to click those links. Keep the user agent the same but set the referrer header in each case. ie: when visiting page 2 set the referrer to page 1.

        I suspect the volume of hits you are talking about wouldn't get noticed by Amazon. The network traffic analysis required to pick this up would be too much of an overhead for their deployment team.

        I've not hit Amazon with scrapers but have hit ebay with a script that hits them over 700 times a minute without ever having to change ips.
        {{ DiscussionBoard.errors[10312018].message }}
        • Profile picture of the author synapticmishap
          Dead Guy

          I'd echo other's responses - if you're concerned about getting a ban, introduce a reasonable delay between page GETs as though a real user were clicking content. You can also set the user agent for the requests to a real one - Chrome or Safari etc - to make it look like a real user.

          Another idea is to vary the delay between page views randomly - 1s-5s using a script. No user spends exactly 1 second looking at every page!

          Also, if you're that concerned, it might be worth just ponying up the money and paying for the API access...

          {{ DiscussionBoard.errors[10314928].message }}

Trending Topics