What language for search engine scraping? experience with.Net?

by heslil
8 replies
A while ago, I started writing a search engine scraper in C# (.Net) and had to resolve many issues until I got a quality tool. What languages/frameworks have you used to write search engine scrapers? Which one do you think is the most suitable?

Before picking C# I tried doing it in Python but on Windows it's not as easy as on Linux. Now my tool is ready, but if I wanted to write this tool from scratch maybe I would use C/C++ (native code). I think scrapebox and other similar tools are written in C/C++.

Just for fun, here is the screenshot my scraper (the trial version that soon will be released):
#code #engine #experience #language #scrapebox #scraper #scraping #search #search engine #withnet
  • Profile picture of the author jimjones
    Hi heslil,

    I am currently working on a crawler project too. The best free open source web crawler in C# is definitely Abot. I am using it too.
    https://code.google.com/p/abot/

    If you need help, let me know.
    {{ DiscussionBoard.errors[9206123].message }}
    • Profile picture of the author heslil
      Thank you, jimjones! Abot seems really interesting, I will check the code later to see what I can learn from it.
      {{ DiscussionBoard.errors[9206285].message }}
      • Profile picture of the author David B
        If you know any Java you may want to take a look at Selenium Selenium - Web Browser Automation It is an opensource project designed to allow for Browser testing by letting the Java libraries connect to and drive your browser execution. There is support for Chrome, IE Firefox etc. Which means it basically lets you write Java code that connects via a driver to your browser and then execute anything your browser does including Javascript, JQuery etc.

        We have used it for JQuery testing, where you want to see how different browsers will actually execute your javascript.
        {{ DiscussionBoard.errors[9206600].message }}
        • Profile picture of the author heslil
          David B, Selenium is really cool, thanks for the tip!
          {{ DiscussionBoard.errors[9208421].message }}
        • Profile picture of the author heslil
          jminkler, that's gold! Thank you!
          {{ DiscussionBoard.errors[9208441].message }}
  • Profile picture of the author jminkler
    Scrapey + Portia (python) are the defacto . have also written them in js using casperjs as a framework of sorts
    {{ DiscussionBoard.errors[9207105].message }}
  • Profile picture of the author jimjones
    Another great tool is PhantomJS. A headless browser. Useful if you need to crawl javascript or ajax enabled sites.
    {{ DiscussionBoard.errors[9208642].message }}
  • Profile picture of the author heslil
    PhantomJS seems really cool too. jimjones, thank you!

    I will compile a list soon and share here. If you guys know some other handy tools please share with us.
    {{ DiscussionBoard.errors[9208826].message }}

Trending Topics