Scraping Concerns

by The Dead Guy

Posted: 11 years ago 4 replies

PROGRAMMING

I have written a basic Amazon scraping program. I use a bar code scanner on the UPC, and the program retrieves data about that product. The program uses no scripts, it simply pulls HTML data.

The routine consists of six GET requests and downloads one image. Generally the program is not run more than three times per minute.

I have no time pauses between GET requests
I am not rotating user-agents between GET requests (can easily do this if needed)
I am not using any proxies

I do not believe I am putting a strain on Amazon servers. But I am wondering if I am at risk of an IP ban. Will Amazon take action against my puny little program?

Before I added the user agent to my requests, My requests would occasionally get blocked by Amazon with a message that told me to use their API for automated access.

I have been using the program for over a year but I have concerns that I am pushing my luck. I cannot risk losing my seller account. Before using proxies or a VPN, was wondering if it is really necessary in my case.

#concerns #scraping

KirkMcD 11 years ago

And Why don't you want to use the Amazon API?
- Thanks
- 1 reply
{{ DiscussionBoard.errors[10308016].message }}
- The Dead Guy 11 years ago
  
  Originally Posted by KirkMcD
  
  And Why don't you want to use the Amazon API?
  
  The API has limitations to the pricing data it returns. The data I am collecting is public data, that I am using for my own personal offline use.
  
  Thanks
  
  1 reply
  
  Signature
  Look! No links in my signature. This means I actually have something useful to say. I'm not just posting to be posting...
  
  {{ DiscussionBoard.errors[10310710].message }}
  
  AboutTown 11 years ago
  
  For those 6 pages you scrape. Do they follow a process that could be carried out naturally by following links in a web browser? If so I would suggest adding a small delay between the requests to simulate the time it would take a user to click those links. Keep the user agent the same but set the referrer header in each case. ie: when visiting page 2 set the referrer to page 1.
  
  I suspect the volume of hits you are talking about wouldn't get noticed by Amazon. The network traffic analysis required to pick this up would be too much of an overhead for their deployment team.
  
  I've not hit Amazon with scrapers but have hit ebay with a script that hits them over 700 times a minute without ever having to change ips.
  
  Thanks
  
  1 reply
  
  {{ DiscussionBoard.errors[10312018].message }}
  
  synapticmishap 11 years ago
  
  Dead Guy
  
  I'd echo other's responses - if you're concerned about getting a ban, introduce a reasonable delay between page GETs as though a real user were clicking content. You can also set the user agent for the requests to a real one - Chrome or Safari etc - to make it look like a real user.
  
  Another idea is to vary the delay between page views randomly - 1s-5s using a script. No user spends exactly 1 second looking at every page!
  
  Also, if you're that concerned, it might be worth just ponying up the money and paying for the API access...
  
  John
  
  Thanks
  
  {{ DiscussionBoard.errors[10314928].message }}

Scraping Concerns

Trending Topics

What Picks You Up?

An introduction

Why Data Science Is the New Marketing Superpower

Stop Winging It: Real Marketing Advice Beginners Actually Need to Hear

An AI Ethics Question