how to track crawlers using PHP

8 replies
how to track crawlers using PHP?

anyone tell me how track web/search crawlers using PHP.
#crawlers #php #programming #script #seo #track #web
  • Profile picture of the author unnatural
    There is no reliable way as user-agents can be faked and they don't necessarily hit robots.txt or follow the rules outlined in robots.txt

    That said, you can look for multiple requests within a short time period (automatically or manually) and raise flags when certain thresholds are reached.
    {{ DiscussionBoard.errors[4779019].message }}
    • Profile picture of the author James Kenton
      Whenever one computer requests a document or file from your website it provides your server with some details about itself. One such detail is the 'user-agent'. This is meant to be the type of browser that's making the request. It's using this information that we can work out whether it's Internet Explorer, Firefox or some other browser that's being used and tweak what is sent to make sure things look right.

      Many search engines tell us who they are. Google seem to identify themselves. They use the 'user-agent' name "Googlebot".

      In theory you could check this field on every request and track the ones that admit to being a known crawler. However, as unnatural rightly says, you can't rely on this. The field can be easily faked. There is no reason why an agent shouldn't lie and claim to be something else.

      An additional field that's supplied is the IP address of the computer making the request. If we knew the IP addresses of the crawlers, we could use that. But we can't be sure that the IP address used will always be the same for each crawler.

      So, in short, you could track some of the crawlers but not all.

      May I ask why you want to use PHP when you could use one of the popular access logs tools to see the activity of many crawlers. Many of these tools are updated regularly and will identify more crawlers because they pool the knowledge and experience of the developers.

      If you would explain why you want to use PHP and we might be able to help you more.
      {{ DiscussionBoard.errors[4782763].message }}
  • Profile picture of the author creativerobert
    @James, thanks for such a wonderful description regarding PHP and Google bots...
    {{ DiscussionBoard.errors[4784206].message }}
  • Profile picture of the author creativerobert
    But one thing I have to say, that there is not a specific time for crawling of the bots so we can't track that. Can We?
    {{ DiscussionBoard.errors[4784221].message }}
    • Profile picture of the author James Kenton
      I'm afraid not. There really is no reliable way to track a bot that doesn't want to be tracked. It's just too easy to disguise a bot. The people programming the bots are bright. Any system you could come up with to track a bot that doesn't want to be tracked can be circumvented.

      It's not good for Google's business model to be manipulated in to showing sales pages to people who aren't looking for sales pages. Nor will the crawlers that are written by hackers want to be fooled when they are seeking sites with poor security.

      If there were a reliable way to identify bots we could present a fantastic content rich information page to Google and a blatant sales page to the human visitor who clicked on the link in the search results.

      A search engine's reputation (profitability) is based on the quality of its results. It's how they make money. Their programmers are always refining the systems they use to index/score pages to ensure that the person performing a search gets the results they want.

      This is why the results people get when using some 'black hat' tricks for Search Engine Optimization (SEO) can be temporary (or even reduce your ranking). You can bet that Google buy every WSO about SEO! They'll want to keep up with the techniques we might use to manipulate their ranking of our sites. Granted they may decide that last week's WSO revealed technique isn't something they'll bother circumventing - but they'll definitely be keeping themselves educated. Google want their customers to get exactly what they are searching for - without any junk. They may not always achieve this but they are always seeking ways to improve.

      In my opinion we should be very careful with Black Hat techniques. If a search engine decide that a site is trying to manipulate it to an unacceptable level they could reduce the sites ranking or even de-index it.

      As for timing: most crawlers are indexing all of the time. You get crawled when your site reaches the top of their 'to do' list. Your place in the 'to do' list is dependent on so many variables that nobody could predict when you'll get indexed. You can't even be sure that some of those visitors (whether unique or regular) to your site today weren't a bot in disguise!
      {{ DiscussionBoard.errors[4800597].message }}
  • Profile picture of the author quicklynx
    Capture: $_SERVER['HTTP_REFERER']; This is as close as you can get to see who's visiting your page with PHP. As mentioned above by the good folks here, it can be faked but you may take a combination of many different PHP "$_SERVER" commands to paint a good picture of who's who. ...Google Bot will come from a list of IPs, if someone is faking GoogleBot then simply match the IPs up. Hope this helps.
    {{ DiscussionBoard.errors[4804818].message }}
  • Profile picture of the author windso0
    What i ended up doing is not pretty but working.
    In my download script i add the following code:
    <?php
    $GLOBALS['PIWIK_TRACKER_DEBUG'] = false;
    define('PIWIK_TRACKER_MODE', true);
    define('PIWIK_INCLUDE_PATH', '/var/www/html/piwik');
    @ignore_user_abort(true);

    set_include_path(PIWIK_INCLUDE_PATH
    . PATH_SEPARATOR . PIWIK_INCLUDE_PATH . '/core'
    . PATH_SEPARATOR . PIWIK_INCLUDE_PATH . '/libs/'
    . PATH_SEPARATOR . PIWIK_INCLUDE_PATH . '/plugins/'
    . PATH_SEPARATOR . get_include_path() );
    require_once "Common.php";
    require_once "PluginsManager.php";
    require_once "Tracker.php";
    require_once "Tracker/Config.php";
    require_once "Tracker/Action.php";
    require_once "Cookie.php";
    require_once "Tracker/Db.php";
    require_once "Tracker/Visit.php";
    require_once "Tracker/GoalManager.php";

    $my_array= array('idsite' => '1','download' => $mydownloadurl , 'redirect' => '0' );
    $_GET=$my_array;
    ob_start();
    $process = new Piwik_Tracker;
    $process->main();
    ob_clean();
    ?>

    I cooked this from different readings. but the bottom line is that it is working.
    The only problem i am trying to solve now is that it refuses to load the "Goal" plugin.
    Any suggestion to make this better or even to get the goal plugin to load would be most welcome.
    {{ DiscussionBoard.errors[4884640].message }}
  • Profile picture of the author DavidWincent
    You can make use of multiple PHP SERVER commands to track the crawlers. We cannot expect a bot crawling at a fixed time. It is hard to predict that when can a bot crawl to our website.
    Signature
    Webmaster Studio -A premier web design and internet marketing company in New York.
    {{ DiscussionBoard.errors[4910302].message }}

Trending Topics