Web scraping for business names

by chi124
10 replies
Hi fellow Warriors,

Are there any good tutorials or classes to take that would help me understand the concepts of web scraping. My google search is not getting me what I want or I may just be intellectually challenged when it comes to coding.

I want to learn to build web scrapers for all kinds of things but mainly I am given a bunch of urls and addresses and want to find a way to scrape the url or actual addresses online to help me find out the actual business name.

Any help on this matter is appreciated.
#business #names #scraping #web
  • Profile picture of the author NobleSavage
    Generally you just use just use a regex library. If you are using php look into cURL and regex.

    There are some scraping plugins for Chrome that might help you get a better idea of how to go about it.

    What language are you using? I may be able to give you more tips?
    {{ DiscussionBoard.errors[9511276].message }}
    • Profile picture of the author kenmichaels
      Originally Posted by NobleSavage View Post

      Generally you just use just use a regex library. If you are using php look into cURL and regex.

      There are some scraping plugins for Chrome that might help you get a better idea of how to go about it.

      What language are you using? I may be able to give you more tips?
      regex is fairly resource intensive. Prob easier for a newb to just parse tags.

      OP here is a rather simple regex pattern,

      Code:
       .Pattern = "(http|ftp|https)://[w-_]+(.[w-_]+)+([w-.,@?^=%&:/~+#]*[w-@?^=%&/~+#])?"
      if you don't understand it then I suggest not bothering with regex
      Signature

      Selling Ain't for Sissies!
      {{ DiscussionBoard.errors[9516226].message }}
      • Profile picture of the author NobleSavage
        Originally Posted by kenmichaels View Post

        regex is fairly resource intensive. Prob easier for a newb to just parse tags.

        OP here is a rather simple regex pattern,

        Code:
         .Pattern = "(http|ftp|https)://[w-_]+(.[w-_]+)+([w-.,@?^=%&:/~+#]*[w-@?^=%&/~+#])?"
        if you don't understand it then I suggest not bothering with regex
        Regex isn't so bad with a cheat sheet. Anyhow, you are correct, it would probably be easier to parse the DOM. You need to find a good DOM parser though, one that supports malformed html. PHP Simple Dom parser always chokes up on me with anything complex.

        To the OP. If you don't want to learn programming, just hire someone to do it. You can get webscraping done cheap.
        {{ DiscussionBoard.errors[9518464].message }}
        • Profile picture of the author hassansin
          With PHP, I use DOMDocument, XPath and curl. Never faced any issues with parsing the DOM so far. All these are built into PHP, never had to use other libraries.

          If you are into Node.js, I suggest to learn requests & cheerio modules.
          {{ DiscussionBoard.errors[9518594].message }}
          • Profile picture of the author chi124
            Thanks for all the replies all. I was looking into python or VBA if possible. But it looks like it will be damn near impossible if I were to do it in python as the HTML structure will be different for each site.
            {{ DiscussionBoard.errors[9523600].message }}
            • Profile picture of the author PhilHardaker
              scrapy is a popular python scraping tool (scrapy.org), more modern than mechanize

              But yes, every url will be different to scrape unless the page layout is identical. Also consider that once you have your scraping program working, there is no guarantee that the website won't change the layout tomorrow. (A tip is if the site supports a mobile layout, that will probably be easier to scrape).

              I agree with NobleSavave though, I would outsource it. That is a common low cost job on Odesk.
              {{ DiscussionBoard.errors[9539119].message }}
  • Profile picture of the author KirkMcD
    Originally Posted by chi124 View Post

    to help me find out the actual business name.
    Unless the pages are identical, that's not gonna happen.
    {{ DiscussionBoard.errors[9512591].message }}
    • Profile picture of the author NobleSavage
      Originally Posted by KirkMcD View Post

      Unless the pages are identical, that's not gonna happen.
      That's why you scrape YellowPages.com
      {{ DiscussionBoard.errors[9514668].message }}
  • Profile picture of the author KirkMcD
    That's great but he said he has a list of urls that he wants to check.
    {{ DiscussionBoard.errors[9516061].message }}
    • Profile picture of the author NobleSavage
      Originally Posted by KirkMcD View Post

      That's great but he said he has a list of urls that he wants to check.
      Scraping the entire db of yellowpages and then comparing with a list of URLs is the best way I can think of doing what OP wants.
      {{ DiscussionBoard.errors[9518446].message }}

Trending Topics