Re: Warning: robots.txt unreliable in Apache servers
- From: Benjamin Niemann <pink@xxxxxxxxxx>
- Date: Sun, 30 Oct 2005 09:57:50 +0100
Anonymous, quoting Philip Ronan wrote:
> I recently discovered that robots.txt files aren't necessarily any use on
> Apache servers.
>
> For some reason, the Apache developers decided to treat multiple
> consecutive forward slashes in a request URI as a single forward slash. So
> for example, <http://apache.org/foundation/> and
> <http://apache.org//////foundation/> both resolve to the same page.
I could not find anything about the semantics of empty path segments in http
URLs, but this behaviour seems to be common practice. What about IIS or
other webservers?
> Let's suppose the Apache website owners want to stop search engine robots
> crawling through their "foundation" pages. They could put this rule in
> their robots.txt file:
>
> Disallow: /foundation/
>
> But if I posted a link to //////foundation/ somewhere, the search engines
> will be quite happy to index it because it isn't covered by this rule.
>
> As a result of all this, Google is currently indexing a page on my website
> that I specifically asked it to stay away from :-(
I would tend to blame googlebot (and any other effected robot). Unless a
different behaviour ('...foo//bar...' and '...foo/bar...' resolve to
different resource on the server) is common practice, the robot should
normalize such pathes (removing empty segments) before matching it against
the rules from the robots.txt file.
--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://www.odahoda.de/
.
- References:
- Warning: robots.txt unreliable in Apache servers
- From: Anonymous , quoting Philip Ronan
- Warning: robots.txt unreliable in Apache servers
- Prev by Date: Warning: robots.txt unreliable in Apache servers
- Next by Date: Re: Warning: robots.txt unreliable in Apache servers
- Previous by thread: Warning: robots.txt unreliable in Apache servers
- Next by thread: Re: Warning: robots.txt unreliable in Apache servers
- Index(es):
Relevant Pages
|