5 replies
I have a website that has over 300,000 dynamic webpages. I am looking for a script that will crawl the site and create a xml sitemap file that I can submit to Google.

I found a desktop version but it has been running for 12 hours now and is only 1/5 of the way there!

John
#google #sitemap
  • Profile picture of the author TristanPerry
    How is the website set-up/structured? Surely it's (in part) database driven, hence you can just make a simple script to create the sitemap?
    Signature
    Plagiarism Guard - Protect Against Content Theft
    {{ DiscussionBoard.errors[1264372].message }}
  • Profile picture of the author john_kennedy
    It is almost 100% database driven (dynamic). I suppose I could do a script if all else fails.

    Looking for an off the shelp solution first.

    John
    {{ DiscussionBoard.errors[1264441].message }}
  • Profile picture of the author TristanPerry
    Since it's 300k pages, I would imagine that a custom script would be best (since the only other option is to use a script, like you are doing, which involves 300,000 Apache requests, over a million MySQL queries, etc), but good luck in your search.
    Signature
    Plagiarism Guard - Protect Against Content Theft
    {{ DiscussionBoard.errors[1264442].message }}
  • Profile picture of the author awesometbn
    I knew there were online services that would automatically crawl and create sitemaps for you. I checked xml-sitemaps.com but it looks like the max is 500 pages.

    What offline software are you using? I found gsitecrawler.com freeware. There are probably others. Since the sitemap is just an XML file format, you could probably script this and create it quickly on your own using Perl.
    {{ DiscussionBoard.errors[1264703].message }}
    • Profile picture of the author Bruce Hearder
      I have quite a few websites with +100K pages, and have found that BigG does not like XML sitemap that are really big.

      There seems to be a bit of a inverse correlation between how big your sitemap file is (ie number of entries) and the speed google will spider & index your site.

      What I have found that its best to brerk your sitemap into a bunch of smaller sitemaps, each one being 500-1000 entries in size and then link all these smaller sitemaps together with a siteindex.

      Its works something like the following:
      <?xml version="1.0" encoding="UTF-8"?>
      <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
      <sitemap>
      <loc>http://www.example.com/sitemap.php?page=1</loc>
      <lastmod>2009-10-11</lastmod>
      </sitemap>
      <sitemap>
      <loc>http://www.example.com/sitemap.php?page=2</loc>
      <lastmod>2009-10-11</lastmod>
      </sitemap>
      <sitemap>
      <loc>http://www.example.com/sitemap.php?page=3</loc>
      <lastmod>2009-10-11</lastmod>
      </sitemap>
      </sitemapindex>

      This then works awesome..

      You then can do some clever stuff like dynamic sitemaps and other tricks and BigG will absolutely love your site, indexed 1000's of pages in a flash..

      Oh, before I forget. To get the search engines to re-visit your sitemap, you must ping them directly.

      I use these (found them on wikipedia) and they seem to work gangbusters..

      Google
      http://www.google.com/webmasters/sit...=URL_OF_SITMAP

      Yahoo!
      http://search.yahooapis.com/SiteExpl...URL_OF_SITEMAP

      Ask.com
      http://submissions.ask.com/ping?sitemap=URL_OF_SITEMAP

      Bing/Live (whatever you wanna call it)
      http://webmaster.live.com/ping.aspx?...URL_OF_SITEMAP

      I hope this helps

      Bruce
      {{ DiscussionBoard.errors[1267043].message }}

Trending Topics