Sunday, March 31, 2013

webcrawler & robots.txt how they work and do we use them for

In this post we will look at the purpose of webcrawlers ( aka  web search bots, spider, ant, robots ) and the purpose of a robots.txt

These ( bots )  are distributed bots,  that crawls web sites and searchs urls as listed on a website. The main purpose of the bot, is to  help index the sites and urls. Bot are used by every search engines, and for indexing URLs that it visits,  in hope of categorizing  content for search engines.

You can check this website for up-to-date known webcrawlers;

Now web crawlers are suppose to honor the robots.txt file  if present. This is a on a trusted honor relationship. Take for example main site;

#$Header: robots.txt,v 1.19 2009/10/19 16:47:17 autreja Exp $ $Locker:  $

# robots.txt file for
# send e-mail to hp<dot>comOperations<at>hp<dot>com for updates or problems

User-agent:    *
Disallow:    /cgi-bin/
Disallow:    /info/
Disallow:    /support/
Disallow:    /JumpData/
Disallow:    /cposupport/
Disallow:       /whpadmin/
Disallow:       /offweb/
Disallow:       /hho/res/us/en/


They provide instruction to web crawlers that provide information on what's allowed or denied.

The "User-agent: *"  , allows all crawlers. If you wanted to be specific, you could included instruction for that known crawler via his User-Agent: id.

The Above "Disallow:" are providing instruction or what we do not want the bot to crawls. These are like signs that says;  don't enter, no parking, or no trespass. They have no enforcement.  The opposite of a "Disallow:" is a "Allow:". You can used these along with the disallow to provide instruction on specific content.


Disallow:  /home/
Allow: /home/kentheethicalhacker/

The "Sitemap:" is just that a sitemap that helps the crawler find content that might otherwise not been found or seen. You can read about sitemap here ;

Here's one of my free mail hosting provider robots.txt;
(output shorten )

petra:~ root# curl | more

User-agent: *
Disallow: /*?ls=*
Disallow: /*?localePreference=*
Disallow: /*;jsessionid=*
Disallow: /*;kid=*
Disallow: */company/
Disallow: /Site/

User-agent: grub-client
Disallow: /

User-agent: grub
Disallow: /

User-agent: looksmart
Disallow: /

User-agent: WebZip
Disallow: /

So don't use a robots.txt file for controlling access. If you have content that needs to be secured or hidden, use the proper  http.access controls and encryption. The web crawlers may, or may not follow your instructions.

Okay so now that we have a basic understanding of web crawlers, what do they search on and what do they with the data?

1st using google, the bot crawls over links, these links  and words within the URLs are next collected for latter indexing. Certain words known as stop words are not  index ( e.g is , the , and , etc,.....)

Next, a indexer takes the collect information and index the search database.

you can get an ideal of what the search engine knows about your site via



It's goal is to help a end user who executes a search, to find content about that site when he/she searches using the google search site.

Other ethical purposes, are to help you as a pen-tester find information and potential targets for a customer that you have an engagement with. As within any war campaign,  Reconnaissance is the 1st step with gaining information about the enemy. The same applies here with regards to a domain.

Okay now, you should have the basic on robots.txt and the purpose of that file,  and who uses them.

here's a few bot entries  you might see in your webaccess.log - - [19/Mar/2013:12:11:30 +0000] "Apache" "GET /robots.txt HTTP/1.1" 80 301 242 0 "-" "Mozilla/5.0 (compatible;
Googlebot/2.1; +" 27701 - - [12/Mar/2013:01:34:12 +0000] "Apache" "GET / HTTP/1.0" 80 200 37113 0 "-" "Pingdom.com_bot_version_1.4_(" 56763

 Happy bot monitoring :)

Ken Felix
Freelance Network Security Engineer
kfelix  @  hyperfeed  d.o.t com

No comments:

Post a Comment