PHP, cURL, and grabbing ReCaptcha image

6 replies
I have used and liked Social Bookmarks Demon for a while now but it broken and I've waited 2 months for them to fix it. Still no fix... So I'm trying to rebuild it myself because I want a PHP/Web based bookmarking tool not a Windows/Mac based bookmarking tool.

Currently Social Bookmarks Demon will register accounts for you with a bunch of sites and then tells you to manually create accounts at several sites with CAPTCHAs. I have found the manually created accounts do the most good but it just isn't practical to fill in all the forms or even use FireForms when you want to create hundreds of accounts at each of these sites.

So my plan is to have my new tool fill in the form for me. Except I haven't come up with a CAPTCHA breaker yet so I need the tool to return the CAPTCHAs for me (or someone else) to answer before submitting the form to the site.

I know the basics of submitting a form using PHP and cURL. I've done it successfully on a few sites. It's not tough. However, many of the best bookmarking sites have CAPTCHA on them (during account registration). For your simple CAPTCHA there's an img tag that you can scrape and grab the image. No problem.

Then you get to the sites with ReCAPTCHA, like bibsonomy, and grabbing those images is much harder. Firebug is giving me some hints on how to get the image from Google (who owns ReCAPTCHA for those who don't know).

Anyone tackle this problem before? Any hints?

Thanks
#curl #grabbing #image #php #recaptcha
  • Profile picture of the author caesargus
    I'd be interested in this solution as well since I'm working on a similar project for work. ( the recapcha images not the social bookmarking tool.)
    {{ DiscussionBoard.errors[2750823].message }}
  • Profile picture of the author SteveJohnson
    One of the features about ReCAPTCHA is that it tracks where the form submission came from. If it's not the same as the address that requested the image, the submission is automatically rejected.

    And that's just ONE of the anti-spammer provisions it utilizes.

    Good luck bypassing it...
    Signature

    The 2nd Amendment, 1789 - The Original Homeland Security.

    Gun control means never having to say, "I missed you."

    {{ DiscussionBoard.errors[2750962].message }}
  • Profile picture of the author spradlig
    Yeah I'd noticed in Firebug that there was a cookie from bibsonomy that appeared to be required. Also, the referer variable was set. The cookie looked to be set on my machine by bibsonomy so I have it and the referer can be set manually.

    However, from Firefox I can right-click and save the image to a file on hard drive. If there were a way to display this local image in my own form - where I'm solving many CAPTCHAs at once - then I'm thinking the submission of the bibsonomy form should be fine so long as my IP hasn't changed during the process.

    My intention is to grab and submit the form on bibsonomy (and other sites) using cURL. However, I'd like to display the ReCAPTCHA on my own form (along with many others) so I can fill out a whole bunch and hit submit once. That way the submissions can be multithreaded and shouldn't take long. Effectively I want to use cURL to do what Fireform already does but Fireform would be slow by comparison.
    {{ DiscussionBoard.errors[2751305].message }}
  • Profile picture of the author spradlig
    @SteveJohnson
    I'm thinking that some of these site matching protections and such are the real SPAM protection. Many people have made the argument that you can't break ReCAPTCHA because Google with all it's dollars couldn't get it's Object Character Recognition (OCR) programs to recognize 1 of the 2 words so why do you think you can? But the logic is flawed. ReCAPTCHA gives you 2 words - one they know and one they don't. You only have to answer correctly is the word they have an answer for. And they have an answer because their OCR has already figured out that word. As a result you know that OCR can in fact figure out the 1 word you have to answer correctly.

    So OCR isn't the problem - nor is it what I'm trying to tackle at the moment. The problem is all these other hurdles like the site matching. I'm beginning to wonder if the only way around this is to create an iFrame for ReCAPTCHAs on the fly...
    {{ DiscussionBoard.errors[2756130].message }}
  • Profile picture of the author levinh
    I think you are talking about the system like this : neobookmark [dot] com
    You can have a look, it's can solve ReCaptcha with Decaptcher service.
    {{ DiscussionBoard.errors[4941318].message }}
  • Profile picture of the author khay
    It can be done but as mentioned the request (and response) must be done by the same machine, in the same cookie session.

    With ReCAPTCHA you need to:

    1. Request the Javascript file that loads the captcha image
    2. Parse out the URL of the image
    3. Download the image

    4. Once you have the image, do as you wish - I've scripted sending it off to something like Decaptcher*.

    5. Send the text back to ReCAPTCHA and check their response

    *If you're using Decaptcher (I think it was these guys, the ones with an API) make sure that if ReCAPTCHA says your response was wrong, return a call to Decaptcher's API and tell them it was wrong. 20-30% of the time they'll get it wrong and if you report it to them as incorrect it won't burn up your credits quite so fast.
    {{ DiscussionBoard.errors[4944616].message }}

Trending Topics