Web scraping for business names

by chi124

Posted: 12 years ago 10 replies

PROGRAMMING

Hi fellow Warriors,

Are there any good tutorials or classes to take that would help me understand the concepts of web scraping. My google search is not getting me what I want or I may just be intellectually challenged when it comes to coding.

I want to learn to build web scrapers for all kinds of things but mainly I am given a bunch of urls and addresses and want to find a way to scrape the url or actual addresses online to help me find out the actual business name.

Any help on this matter is appreciated.

#business #names #scraping #web

NobleSavage

12 years ago

Generally you just use just use a regex library. If you are using php look into cURL and regex.

There are some scraping plugins for Chrome that might help you get a better idea of how to go about it.

What language are you using? I may be able to give you more tips?

Thanks
1 reply

{{ DiscussionBoard.errors[9511276].message }}

kenmichaels

12 years ago

Originally Posted by NobleSavage

regex is fairly resource intensive. Prob easier for a newb to just parse tags.

OP here is a rather simple regex pattern,

Code:

 .Pattern = "(http|ftp|https)://[w-_]+(.[w-_]+)+([w-.,@?^=%&amp;:/~+#]*[w-@?^=%&amp;/~+#])?"

if you don't understand it then I suggest not bothering with regex

Thanks
1 reply

Signature

Selling Ain't for Sissies!

{{ DiscussionBoard.errors[9516226].message }}

NobleSavage

12 years ago

Originally Posted by kenmichaels

regex is fairly resource intensive. Prob easier for a newb to just parse tags.

OP here is a rather simple regex pattern,

Code:

 .Pattern = "(http|ftp|https)://[w-_]+(.[w-_]+)+([w-.,@?^=%&amp;:/~+#]*[w-@?^=%&amp;/~+#])?"

if you don't understand it then I suggest not bothering with regex

Regex isn't so bad with a cheat sheet. Anyhow, you are correct, it would probably be easier to parse the DOM. You need to find a good DOM parser though, one that supports malformed html. PHP Simple Dom parser always chokes up on me with anything complex.

To the OP. If you don't want to learn programming, just hire someone to do it. You can get webscraping done cheap.

Thanks
1 reply

{{ DiscussionBoard.errors[9518464].message }}

hassansin 12 years ago

With PHP, I use DOMDocument, XPath and curl. Never faced any issues with parsing the DOM so far. All these are built into PHP, never had to use other libraries.

If you are into Node.js, I suggest to learn requests & cheerio modules.
- Thanks
- 1 reply
{{ DiscussionBoard.errors[9518594].message }}
- chi124 12 years ago
  
  Thanks for all the replies all. I was looking into python or VBA if possible. But it looks like it will be damn near impossible if I were to do it in python as the HTML structure will be different for each site.
  
  Thanks
  
  1 reply
  
  {{ DiscussionBoard.errors[9523600].message }}
  
  PhilHardaker 12 years ago
  
  scrapy is a popular python scraping tool (scrapy.org), more modern than mechanize
  
  But yes, every url will be different to scrape unless the page layout is identical. Also consider that once you have your scraping program working, there is no guarantee that the website won't change the layout tomorrow. (A tip is if the site supports a mobile layout, that will probably be easier to scrape).
  
  I agree with NobleSavave though, I would outsource it. That is a common low cost job on Odesk.
  
  Thanks
  
  {{ DiscussionBoard.errors[9539119].message }}

KirkMcD 12 years ago

Originally Posted by chi124

to help me find out the actual business name.

Unless the pages are identical, that's not gonna happen.
- Thanks
- 1 reply
{{ DiscussionBoard.errors[9512591].message }}
- NobleSavage 12 years ago
  
  Originally Posted by KirkMcD
  
  Unless the pages are identical, that's not gonna happen.
  
  That's why you scrape YellowPages.com
  
  Thanks
  
  {{ DiscussionBoard.errors[9514668].message }}
KirkMcD 12 years ago

That's great but he said he has a list of urls that he wants to check.
- Thanks
- 1 reply
{{ DiscussionBoard.errors[9516061].message }}
- NobleSavage 12 years ago
  
  Originally Posted by KirkMcD
  
  That's great but he said he has a list of urls that he wants to check.
  
  Scraping the entire db of yellowpages and then comparing with a list of URLs is the best way I can think of doing what OP wants.
  
  Thanks
  
  {{ DiscussionBoard.errors[9518446].message }}

Web scraping for business names

Trending Topics

Walking for Exercise is a Different Animal...

Being a Girl Dad....

I wanted to make a game blogging site with the help of claude will it monetize ?

What Picks You Up?

I am new here