How music sites crawl all over the web?

5 replies
Anyone knows how music sites like dilandau, mp3skull, and beemp3 crawls for direct mp3 links from servers all over the web?

If anyone's an expert on these stuff, kindly pm me, willing to pay a sizeable amount of cash
#crawl #music #sites #web
  • Profile picture of the author phpg
    More or less like a regular search engine. The difference is they generally don't have resources to crawl all the web periodically with rather short intervals (like google does), and they don't need to - they make something like internal "ranking" of sites, taking into account some statistics like how many mp3s (or links to mp3s) does the site have, how long do they live, how often they refresh etc., and crawl "top ranking" sites more often, while ignoring vast majority of sites completely. Something like that.
    {{ DiscussionBoard.errors[6223160].message }}
  • {{ DiscussionBoard.errors[6223544].message }}
    • Profile picture of the author phpg
      Originally Posted by JayWiz View Post

      Here is my opinion:
      1. Search engine
      2. Other mp3 crawler sites
      3. File sharing sites

      These are used by meta-search sites, sites like beemp3 are more complicated.
      {{ DiscussionBoard.errors[6223701].message }}
  • Profile picture of the author Nochek
    I don't know how any other site does it, but this is how I would do it:

    Get a list of large mp3 repositories.
    Once a week, run a cron job to scrape each one for mp3 info (title, desc, length, link, etc)
    Stash it in a local database for easier retrieval.
    Depending on your server, I would just stream the music through rather than downloading and reserving it, or just serve the user the mp3 link directly.

    You could also just smash search engines and the web with a new spider, and do a regex search:

    /^([a-zA-Z0-9_-]+)(\.mp3)$/

    Then parse out the extra information on the fly by simulating a human language interpreter and parsing all values into your database. Then you would have the most extensive library next to Google's (after a very long time indexing, and only if you make a good spider)
    Signature
    Nochek Solutions Presents:
    The Hydrurga WSO - Rank Your Site #1 And Score Over The Penguin Updates!
    {{ DiscussionBoard.errors[6223765].message }}
  • Profile picture of the author Randy27
    Other mp3 crawler sites
    {{ DiscussionBoard.errors[6224206].message }}

Trending Topics