These ( bots ) are distributed bots, that crawls web sites and searchs urls as listed on a website. The main purpose of the bot, is to help index the sites and urls. Bot are used by every search engines, and for indexing URLs that it visits, in hope of categorizing content for search engines.
You can check this website for up-to-date known webcrawlers;
Now web crawlers are suppose to honor the robots.txt file if present. This is a on a trusted honor relationship. Take for example http://www.hp.com main site;
#$Header: robots.txt,v 1.19 2009/10/19 16:47:17 autreja Exp $ $Locker: $
# robots.txt file for www.hp.com
# send e-mail to hp<dot>comOperations<at>hp<dot>com for updates or problems
They provide instruction to web crawlers that provide information on what's allowed or denied.
The "User-agent: *" , allows all crawlers. If you wanted to be specific, you could included instruction for that known crawler via his User-Agent: id.
The Above "Disallow:" are providing instruction or what we do not want the bot to crawls. These are like signs that says; don't enter, no parking, or no trespass. They have no enforcement. The opposite of a "Disallow:" is a "Allow:". You can used these along with the disallow to provide instruction on specific content.
The "Sitemap:" is just that a sitemap that helps the crawler find content that might otherwise not been found or seen. You can read about sitemap here ; http://www.sitemaps.org/
Here's one of my free mail hosting provider robots.txt;
(output shorten )
petra:~ root# curl www.mail.com/robots.txt | more
So don't use a robots.txt file for controlling access. If you have content that needs to be secured or hidden, use the proper http.access controls and encryption. The web crawlers may, or may not follow your instructions.
Okay so now that we have a basic understanding of web crawlers, what do they search on and what do they with the data?
1st using google, the bot crawls over links, these links and words within the URLs are next collected for latter indexing. Certain words known as stop words are not index ( e.g is , the , and , etc,.....)
Next, a indexer takes the collect information and index the search database.
you can get an ideal of what the search engine knows about your site via
It's goal is to help a end user who executes a search, to find content about that site when he/she searches using the google search site.
Other ethical purposes, are to help you as a pen-tester find information and potential targets for a customer that you have an engagement with. As within any war campaign, Reconnaissance is the 1st step with gaining information about the enemy. The same applies here with regards to a domain.
Okay now, you should have the basic on robots.txt and the purpose of that file, and who uses them.
here's a few bot entries you might see in your webaccess.log
220.127.116.11 - - [19/Mar/2013:12:11:30 +0000] 192.0.2.1 www.mydomain.com "Apache" "GET /robots.txt HTTP/1.1" 80 301 242 0 "-" "Mozilla/5.0 (compatible;
Googlebot/2.1; +http://www.google.com/bot.html)" 27701
18.104.22.168 - - [12/Mar/2013:01:34:12 +0000] 192.0.2.1
www.mydomain.com "Apache" "GET / HTTP/1.0" 80 200 37113 0 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 56763
Happy bot monitoring :)
Freelance Network Security Engineer
kfelix @ hyperfeed d.o.t com