Re: Warning: robots.txt unreliable in Apache servers
- From: David Ross <nobody@xxxxxxxxxxx>
- Date: Sun, 30 Oct 2005 09:58:07 -0800
Anonymous, quoting Philip Ronan wrote:
>
>
> Subject: Warning: robots.txt unreliable in Apache servers
> From: Philip Ronan <invalid@xxxxxxxxxxxxxxx>
> Newsgroups: alt.internet.search-engines
> Message-ID: <BF89BF33.39FDF%invalid@xxxxxxxxxxxxxxx>
> Date: Sat, 29 Oct 2005 23:07:46 GMT
>
> Hi,
>
> I recently discovered that robots.txt files aren't necessarily any use on
> Apache servers.
>
> For some reason, the Apache developers decided to treat multiple consecutive
> forward slashes in a request URI as a single forward slash. So for example,
> <http://apache.org/foundation/> and <http://apache.org//////foundation/>
> both resolve to the same page.
>
> Let's suppose the Apache website owners want to stop search engine robots
> crawling through their "foundation" pages. They could put this rule in their
> robots.txt file:
>
> Disallow: /foundation/
>
> But if I posted a link to //////foundation/ somewhere, the search engines
> will be quite happy to index it because it isn't covered by this rule.
>
> As a result of all this, Google is currently indexing a page on my website
> that I specifically asked it to stay away from :-(
>
> You might want to check the behaviour of your servers to see if you're
> vulnerable to the same sort of problem.
>
> If anyone's interested, I've put together a .htaccess rule and a PHP script
> that seem to sort things out.
I thought that parsing and processing a robots.txt file is the
responsibility of the bot and not the Web server. All the Web
server has to do is deliver the robots.txt file to the bot.
If that is true, the problem lies within Google and not Apache.
--
David E. Ross
<URL:http://www.rossde.com/>
I use Mozilla as my Web browser because I want a browser that
complies with Web standards. See <URL:http://www.mozilla.org/>.
.
- References:
- Warning: robots.txt unreliable in Apache servers
- From: Anonymous , quoting Philip Ronan
- Warning: robots.txt unreliable in Apache servers
- Prev by Date: Redirect in .htaccess
- Next by Date: Web page access
- Previous by thread: Re: Warning: robots.txt unreliable in Apache servers
- Next by thread: Redirect in .htaccess
- Index(es):
Relevant Pages
|