Simple web crawler/spider

11 replies
Hi all,

I am just developing a very simple web spider/crawler. Here is the code:

PHP Code:
<?php

$seed 
"http://www.akosblog.com";
$html file_get_contents($seed);
echo 
"Page : " $seed;
preg_match_all("/http:\/\/[^\"\s']+/"$html$matchesPREG_SET_ORDER);

foreach (
$matches as $val) {
echo 
"<br><font color=red>links :</font> " $val[0] . "\r\n";


}
?>
This code just gets all the links from the selected page. Now I want to move on, I want the spider to follow links and index another link and another.
So how could I do that?

Regards,
Akos
#crawler or spider #simple #web
  • Profile picture of the author ussher
    here is a link to a GPL search script that spiders your site for search terms.

    Take a look at how its doing the spider system.
    Orca PHP Scripts - Camouflaged PHP/MySQL Web Applications
    Signature

    "Jamroom is a Profile Centric CMS system suitable as a development framework for building entire communities. Highly modular in concept. Suitable for enterprise level development teams or solo freelancers."

    - jamroom.net
    Download Jamroom free: Download
    {{ DiscussionBoard.errors[4077237].message }}
  • Profile picture of the author tks
    Have a look at sphider => sphider.eu/about.php
    {{ DiscussionBoard.errors[4078123].message }}
  • You need to put everything into a recursive function.
    Signature
    http://premiumwebtechnologies.com

    Affordable, Wordpress plugins & Web Applications
    {{ DiscussionBoard.errors[9691957].message }}
    • Profile picture of the author rts2271
      Storing the data in bulk is going to be more of a issue then retrieving it. Check out the NoSQL solutions like Mongo or Couchbase to store as a collection then batch it into a RBDMS is needed. Will give you decent performance without a write locked table
      {{ DiscussionBoard.errors[9692956].message }}
  • Profile picture of the author briannn
    Dude I suggest you to use "PHP Simple HTML DOM Parser". It will make your job more easier. You can download and read the documentations from here: simplehtmldom.sourceforge.net
    {{ DiscussionBoard.errors[9702504].message }}
    • Profile picture of the author GeneralLedger
      Another option is kimonolabs.com or import.io
      {{ DiscussionBoard.errors[9705301].message }}
  • Profile picture of the author blinkenlights
    I also suggest you use an existing spider codebase instead of rolling your own, also ideally the code should be DOM-based instead of using regular expressions. Regex tends to be more brittle to website changes.

    Regarding your actual question about spidering sub-links, what you usually do is initialize a queue to store the URLs. Then you populate the queue with the seed URL. Then you create a loop that pops a URL from the queue, downloads the page, extracts the sub-links from it, and adds those sublinks to the queue (You probably want to add some more sophistication to it like only adding urls on the same domain, and up to a certain depth). You loop over the queue until it's empty. When adding the sublinks to the queue you usually add them to the end of the queue and pop from the front (creates a breadth-first search).
    {{ DiscussionBoard.errors[9706172].message }}
  • Profile picture of the author mojojuju
    I think I'd look into using something like Nutch unless you're doing this for simply educational purposes.
    Signature

    :)

    {{ DiscussionBoard.errors[9707006].message }}
  • Profile picture of the author yogyogi
    You are making a search engine?
    You have to do some mathematics and algorithm study to bring a best solution here. In fact google too hires to mathematicians who develop a faster and economical internet search algorithm for them.
    Study two things -
    1. Mathematics Algorithms - it is a subject of engineering students, if you have a friend in engineering consult him/her.
    2. Data Structures - this is the concept of structure and deals with how to get onto different structures. Internet is also a structural based.

    Thanks.
    Signature
    WordPress, jQuery, HTML tutorials for Beginners & Experts.
    Professional Web Developer providing high quality Ecommerce Website Designing.
    {{ DiscussionBoard.errors[9723202].message }}
    • Profile picture of the author Zach Zhang
      python + BS will be much easier than PHP.
      {{ DiscussionBoard.errors[9744483].message }}
  • Profile picture of the author Microsys
    Generally you will have to decide how much you want to code yourself. Is this for a one-off solution it is probably best to use already well developed solutions created by others. However, if it is a product you want to sell, you may want to write a larger part yourself, so you are not dependent on anything 3rd party. (Of course, depending on if your project is commercial or not, you may have a fairly large amount of open source projects to pick from.)

    Also worth noting is that solutions that work well on 1000 page websites can fail on one million page websites. (And a whole slew of other potential website problems and issues.)
    {{ DiscussionBoard.errors[9944529].message }}

Trending Topics