Simple web crawler/spider

by AkosBlog

Posted: 15 years ago 11 replies

PROGRAMMING

Hi all,

I am just developing a very simple web spider/crawler. Here is the code:

PHP Code:

  <?php

$seed = "http://www.akosblog.com";
$html = file_get_contents($seed);
echo "Page : " . $seed;
preg_match_all("/http:\/\/[^\"\s']+/", $html, $matches, PREG_SET_ORDER);

foreach ($matches as $val) {
echo "<br><font color=red>links :</font> " . $val[0] . "\r\n";


}
?>

This code just gets all the links from the selected page. Now I want to move on, I want the spider to follow links and index another link and another.
So how could I do that?

Regards,
Akos

#crawler or spider #simple #web

ussher 15 years ago

here is a link to a GPL search script that spiders your site for search terms.

Take a look at how its doing the spider system.
Orca PHP Scripts - Camouflaged PHP/MySQL Web Applications
- Thanks
Signature

"Jamroom is a Profile Centric CMS system suitable as a development framework for building entire communities. Highly modular in concept. Suitable for enterprise level development teams or solo freelancers."
- jamroom.net
Download Jamroom free: Download
{{ DiscussionBoard.errors[4077237].message }}
tks 15 years ago

Have a look at sphider => sphider.eu/about.php
- Thanks
Signature
Cloud Computing
Android Project Ideas
{{ DiscussionBoard.errors[4078123].message }}
premiumwebtechnologies 11 years ago

You need to put everything into a recursive function.
- Thanks
- 1 reply
Signature
http://premiumwebtechnologies.com

Affordable, Wordpress plugins & Web Applications
{{ DiscussionBoard.errors[9691957].message }}
- rts2271 11 years ago
  
  Storing the data in bulk is going to be more of a issue then retrieving it. Check out the NoSQL solutions like Mongo or Couchbase to store as a collection then batch it into a RBDMS is needed. Will give you decent performance without a write locked table
  
  Thanks
  
  Signature
  
  Ralph Smith
  Mercenary development and deployment.
  
  {{ DiscussionBoard.errors[9692956].message }}
briannn 11 years ago

Dude I suggest you to use "PHP Simple HTML DOM Parser". It will make your job more easier. You can download and read the documentations from here: simplehtmldom.sourceforge.net
- Thanks
- 1 reply
{{ DiscussionBoard.errors[9702504].message }}
- GeneralLedger 11 years ago
  
  Another option is kimonolabs.com or import.io
  
  Thanks
  
  {{ DiscussionBoard.errors[9705301].message }}
blinkenlights 11 years ago

I also suggest you use an existing spider codebase instead of rolling your own, also ideally the code should be DOM-based instead of using regular expressions. Regex tends to be more brittle to website changes.

Regarding your actual question about spidering sub-links, what you usually do is initialize a queue to store the URLs. Then you populate the queue with the seed URL. Then you create a loop that pops a URL from the queue, downloads the page, extracts the sub-links from it, and adds those sublinks to the queue (You probably want to add some more sophistication to it like only adding urls on the same domain, and up to a certain depth). You loop over the queue until it's empty. When adding the sublinks to the queue you usually add them to the end of the queue and pop from the front (creates a breadth-first search).
- Thanks
{{ DiscussionBoard.errors[9706172].message }}
mojojuju 11 years ago

I think I'd look into using something like Nutch unless you're doing this for simply educational purposes.
- Thanks
Signature

:)
{{ DiscussionBoard.errors[9707006].message }}
yogyogi 11 years ago

You are making a search engine?
You have to do some mathematics and algorithm study to bring a best solution here. In fact google too hires to mathematicians who develop a faster and economical internet search algorithm for them.
Study two things -
1. Mathematics Algorithms - it is a subject of engineering students, if you have a friend in engineering consult him/her.
2. Data Structures - this is the concept of structure and deals with how to get onto different structures. Internet is also a structural based.

Thanks.
- Thanks
- 1 reply
Signature
WordPress, jQuery, HTML tutorials for Beginners & Experts.
Professional Web Developer providing high quality Ecommerce Website Designing.
{{ DiscussionBoard.errors[9723202].message }}
- Zach Zhang 11 years ago
  
  python + BS will be much easier than PHP.
  
  Thanks
  
  {{ DiscussionBoard.errors[9744483].message }}
Microsys 11 years ago

Generally you will have to decide how much you want to code yourself. Is this for a one-off solution it is probably best to use already well developed solutions created by others. However, if it is a product you want to sell, you may want to write a larger part yourself, so you are not dependent on anything 3rd party. (Of course, depending on if your project is commercial or not, you may have a fairly large amount of open source projects to pick from.)

Also worth noting is that solutions that work well on 1000 page websites can fail on one million page websites. (And a whole slew of other potential website problems and issues.)
- Thanks
Signature
XML-Sitemaps-Generator.com | HREFLang-Sitemaps.com | HTML-Sitemap.com | TechSEO360.com
{{ DiscussionBoard.errors[9944529].message }}

Simple web crawler/spider

Trending Topics

Don't set new years resolutions

How to add Backlink ? I am under thread of losing the job

Is Off-page SEO become hard as Ai Energizes

Which Social Platform Is Giving You the Best ROI This Year?

Digital Marketing & Instagram Growth -- Beginner Friendly Strategies?