Re: Warning: robots.txt unreliable in Apache servers



Anonymous, quoting Philip Ronan wrote:

> I recently discovered that robots.txt files aren't necessarily any use on
> Apache servers.
>
> For some reason, the Apache developers decided to treat multiple
> consecutive forward slashes in a request URI as a single forward slash. So
> for example, <http://apache.org/foundation/> and
> <http://apache.org//////foundation/> both resolve to the same page.

I could not find anything about the semantics of empty path segments in http
URLs, but this behaviour seems to be common practice. What about IIS or
other webservers?

> Let's suppose the Apache website owners want to stop search engine robots
> crawling through their "foundation" pages. They could put this rule in
> their robots.txt file:
>
> Disallow: /foundation/
>
> But if I posted a link to //////foundation/ somewhere, the search engines
> will be quite happy to index it because it isn't covered by this rule.
>
> As a result of all this, Google is currently indexing a page on my website
> that I specifically asked it to stay away from :-(

I would tend to blame googlebot (and any other effected robot). Unless a
different behaviour ('...foo//bar...' and '...foo/bar...' resolve to
different resource on the server) is common practice, the robot should
normalize such pathes (removing empty segments) before matching it against
the rules from the robots.txt file.

--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://www.odahoda.de/
.



Relevant Pages

  • Warning: robots.txt unreliable in Apache servers
    ... Apache servers. ... the Apache developers decided to treat multiple consecutive ... Let's suppose the Apache website owners want to stop search engine robots ... vulnerable to the same sort of problem. ...
    (alt.internet.search-engines)
  • RE: apache being bombarded
    ... Subject: apache being bombarded ... > one of my apache servers is being bombarded by some IPs (in different ... > BTW ive put that IPs in my /etc/hosts.deny still no joy. ... > durga prasad ...
    (Security-Basics)
  • Re: apache being bombarded
    ... iptables firewalls. ... > one of my apache servers is being bombarded by some IPs (in different ... > BTW ive put that IPs in my /etc/hosts.deny still no joy. ...
    (Security-Basics)