Programmer recommendation for 'scraping' type software

16 replies
I am in my preliminary research phase to have a software created that is a 'scraper' of sorts.

Without going into too much detail at this point, it would be where we enter certain data, like a keyword or a domain name and the software would search and scrape certain predetermined sites for the results and scrape certain data from those sites.

Once the data is retrieved it would put everything into a nicely formatted report (pdf/doc/xls) and store the search in an online database for future reference.

If someone could explain the pros/cons of what type of system/platform would work best in this situation, that would help me find the right person to make this beast.

For example, I am leaning towards having this a web/server hosted platform that would be on the backside of our website, where we can log in and generate the reports. But maybe a desktop version would be better? (Adobe Air platform or other?)

Any preliminary tips to help point me in the right direction would help a great deal. (And any top notch programmers you know would be nice too!!) I don't really know all the programmer lingo to put my help-wanted ad together yet.

Thanks!
#craping #programmer #recommendation #scraper #scraping #software #type
  • Profile picture of the author SebastianJ
    I do a lot of scraping and I fanatically use the combination of Ruby on Rails and a plugin for Rails called Nokogiri. I've done a fair bit of scraping in PHP and Java and it just doesn't compare to Rails + Nokogiri.

    A desktop version would only be better if you need to have native multi-threaded support (that is - fetching a lot of websites in parallel). This can still be done using for example JRuby, but if you really want to create a desktop app I'd suggest that you create it in C#/.NET (Windows only).

    All in all, if I were you I would go down the web app path and first of all use Rails + Nokogiri or alternatively use PHP with a parsing library, e.g. PHP Simple HTML DOM Parser
    {{ DiscussionBoard.errors[4547271].message }}
  • Profile picture of the author bettor
    I am using php and curl library in combination with xpath. its been fantastic I strongly recommend it for scraping
    {{ DiscussionBoard.errors[4547400].message }}
  • Profile picture of the author VegasGreg
    Added information:

    We wouldn't be doing "mass" scraping of thousands of items, it would be maybe 10-50 data pieces from 4-10 sources for each entry. Not sure if that matters for the format we go, but I think it would require less server load than some of the mass scrapers I have seen.
    Signature

    Greg Schueler - Wordpress Fanatic... Living The Offline Marketing Dream...

    {{ DiscussionBoard.errors[4547444].message }}
  • Profile picture of the author Big Squid
    If you're not scraping too much data, PHP would be fine. If you want to keep the locks on it and offer it as a membership, PHP works well for this.

    However, PHP cannot multi-thread- meaning it only handles the code, line by line. C# would allow you to retrieve your data much quicker. If you think this project may expand into a greater amount of scraping, I'd develop it in C#.
    {{ DiscussionBoard.errors[4551672].message }}
    • Profile picture of the author unnatural
      PHP can't multi-thread natively but you can easily run more than one thread at a time if you set it up properly - so IMO it works just fine.
      {{ DiscussionBoard.errors[4551732].message }}
  • Profile picture of the author zenyatta
    Originally Posted by VegasGreg View Post

    I am in my preliminary research phase to have a software created that is a 'scraper' of sorts.

    Without going into too much detail at this point, it would be where we enter certain data, like a keyword or a domain name and the software would search and scrape certain predetermined sites for the results and scrape certain data from those sites.

    Once the data is retrieved it would put everything into a nicely formatted report (pdf/doc/xls) and store the search in an online database for future reference.

    If someone could explain the pros/cons of what type of system/platform would work best in this situation, that would help me find the right person to make this beast.

    For example, I am leaning towards having this a web/server hosted platform that would be on the backside of our website, where we can log in and generate the reports. But maybe a desktop version would be better? (Adobe Air platform or other?)

    Any preliminary tips to help point me in the right direction would help a great deal. (And any top notch programmers you know would be nice too!!) I don't really know all the programmer lingo to put my help-wanted ad together yet.

    Thanks!

    Hi Greg,

    No reason to reinvent the wheel when there is a great product already out there that does everything you mentioned. Searching keywords and domains on everything from Google, Google Maps, Google Places, Bing, Yahoo, 3 different yellowpage directories, Craigslist and a whole lot more.

    It shows all the ranking info, header titles, keywords, meta tags, whether the site has Facebook and Twitter accounts, as well as whether they have claimed their Google Places, and again so much more. It outputs it in a very easy to use format and even has an emailing feature if you want to contact all the email addresses it has drilled down to find. It is multi-threading so it is quite incredible the speed with which it can gather all this info.

    I have used it for 6 months and I swear by it. I can't count the 100s of hours of time it has saved me. It is the BEST VA I have ever hired. LOL.
    You can see it in action here http://bit.ly/rqcWjV .

    Good Luck,
    Zenyatta
    Signature



    {{ DiscussionBoard.errors[4551790].message }}
    • Profile picture of the author VegasGreg
      Originally Posted by zenyatta View Post

      Hi Greg,

      No reason to reinvent the wheel when there is a great product already out there that does everything you mentioned. Searching keywords and domains on everything from Google, Google Maps, Google Places, Bing, Yahoo, 3 different yellowpage directories, Craigslist and a whole lot more.

      It shows all the ranking info, header titles, keywords, meta tags, whether the site has Facebook and Twitter accounts, as well as whether they have claimed their Google Places, and again so much more. It outputs it in a very easy to use format and even has an emailing feature if you want to contact all the email addresses it has drilled down to find. It is multi-threading so it is quite incredible the speed with which it can gather all this info.

      I have used it for 6 months and I swear by it. I can't count the 100s of hours of time it has saved me. It is the BEST VA I have ever hired. LOL.
      You can see it in action here.

      Good Luck,
      Zenyatta
      Thanks. There are a lot of those types of software out there, and I only need a part of that. The extra parts are not readily available (that I have found yet). I only mentioned a part of what I need the whole thing to do as a general overview.
      Signature

      Greg Schueler - Wordpress Fanatic... Living The Offline Marketing Dream...

      {{ DiscussionBoard.errors[4551999].message }}
  • Profile picture of the author lovenot
    Use C# for the backend for multi-threading, speed, proxies and stuffs and PHP for your web...
    {{ DiscussionBoard.errors[4553547].message }}
  • Profile picture of the author TigerNone
    I agree with what has been said here, if you are scraping a large volume of pages, then you need something multi-threaded. This will most likely be written in C. Fortunately, this is a pretty common problem so there are lots of open source scapers out there.

    If you want to roll your own, I think Perl is a good choice because of its excellent regular expressions. There are also some Perl modules like WWW::Mechanize that will make the scraping much easier. Alternatively there is a javascript tool called PhantomJS that runs on V8. This should make it pretty speedy. Also, it can scrape pages after javascript has loaded, so if there is any AJAX content on a page, it will grab it.

    Another important thing to remember is that while multi-threading is great, it's likely that the limiting factor will be your internet connection, not how fast the scraper can run on your box.
    {{ DiscussionBoard.errors[4557758].message }}
  • Profile picture of the author lconsult
    If you just want to scrape keywords from a website, try the trial version (look for the red "Download Now" box) www.PPCKeywordToolz.com. Select Tools->Keyword Tools->Scrape Keywords.

    The trial version will scrape one domain at a time.

    It won't take the place of one of the programming combinations already mentioned; however, for quick keyword scraping, you can't beat it!
    {{ DiscussionBoard.errors[4559007].message }}
  • Profile picture of the author kamirao
    Its a matter of which programming language you are comfortable with:

    For perl programmers, using WWW::Mechanize and HTML::TreeBuilder modules is a good option
    For php, use curl (for crawling) and xpath (for parsing)
    For python use urllib2 and BeautifulSoup (or scrapy framework for python)
    {{ DiscussionBoard.errors[4559967].message }}
  • Profile picture of the author NoFluff
    Depending on how limited you are in time/technical abilities, a browser based scraper might work well for you too.

    I use something called Web Content Extractor, and the only thing I hate is that it uses IE (which is slow as a snail for me) as the browser. But once you get the hang of it, you can easily browse to a page, create a format for the data you want (title, specific link, etc). and get the results back in a CSV. I use it for keywords, building site databases, getting link info, and while the CSV is best for me, it's a generic enough format that you can use it across several databases and scripting languages.
    {{ DiscussionBoard.errors[4562747].message }}
    • Profile picture of the author Thrasher66099
      I'd suggest a .net framework. Most people here are suggesting C# and I believe that's a great option but if you want it created as quickly as possible I'd suggest VB.NET. Now you're going to have a little less flexibility with VB.NET but because all .NET frameworks compile into the same code you don't actually lose any of the true functionality or efficiency. Another great thing about VB.NET is all of the syntactical sugar. The with keyword itself makes VB.NET truly powerful in creating applications quickly.
      {{ DiscussionBoard.errors[4567696].message }}
  • Profile picture of the author Bofu2U
    +1 for Ruby and Nokogiri. However, keep in mind I said Ruby not Rails. Rails + Noko would stall out more than it would do good and you lose RAM and CPU power by loading Rails entirely.
    {{ DiscussionBoard.errors[4568326].message }}
    • Profile picture of the author SebastianJ
      Originally Posted by Bofu2U View Post

      +1 for Ruby and Nokogiri. However, keep in mind I said Ruby not Rails. Rails + Noko would stall out more than it would do good and you lose RAM and CPU power by loading Rails entirely.
      The reason I proposed Rails earlier was entirely because I pretty much always use Rails for every Ruby project I do (since I pretty much only do web apps). If he's not going to do a front-end etc. that displays the stuff he's scraped then yeah, he should just stick to nokogiri + ruby or preferably in this case, jruby.
      {{ DiscussionBoard.errors[4570431].message }}
      • Profile picture of the author Bofu2U
        Originally Posted by SebastianJ View Post

        The reason I proposed Rails earlier was entirely because I pretty much always use Rails for every Ruby project I do (since I pretty much only do web apps). If he's not going to do a front-end etc. that displays the stuff he's scraped then yeah, he should just stick to nokogiri + ruby or preferably in this case, jruby.
        Yeah. The best hybrid solution that I mess with (RAM use/quickness aside) is rails frontend piping to delayed_job for the backend processing. Definitely not as fast as my 500+ thread apps but it gets the job done.
        {{ DiscussionBoard.errors[4708010].message }}

Trending Topics