Re: Warning: robots.txt unreliable in Apache servers



Anonymous, quoting Philip Ronan wrote:
>
>
> Subject: Warning: robots.txt unreliable in Apache servers
> From: Philip Ronan <invalid@xxxxxxxxxxxxxxx>
> Newsgroups: alt.internet.search-engines
> Message-ID: <BF89BF33.39FDF%invalid@xxxxxxxxxxxxxxx>
> Date: Sat, 29 Oct 2005 23:07:46 GMT
>
> Hi,
>
> I recently discovered that robots.txt files aren't necessarily any use on
> Apache servers.
>
> For some reason, the Apache developers decided to treat multiple consecutive
> forward slashes in a request URI as a single forward slash. So for example,
> <http://apache.org/foundation/> and <http://apache.org//////foundation/>
> both resolve to the same page.
>
> Let's suppose the Apache website owners want to stop search engine robots
> crawling through their "foundation" pages. They could put this rule in their
> robots.txt file:
>
> Disallow: /foundation/
>
> But if I posted a link to //////foundation/ somewhere, the search engines
> will be quite happy to index it because it isn't covered by this rule.
>
> As a result of all this, Google is currently indexing a page on my website
> that I specifically asked it to stay away from :-(
>
> You might want to check the behaviour of your servers to see if you're
> vulnerable to the same sort of problem.
>
> If anyone's interested, I've put together a .htaccess rule and a PHP script
> that seem to sort things out.

I thought that parsing and processing a robots.txt file is the
responsibility of the bot and not the Web server. All the Web
server has to do is deliver the robots.txt file to the bot.

If that is true, the problem lies within Google and not Apache.

--

David E. Ross
<URL:http://www.rossde.com/>

I use Mozilla as my Web browser because I want a browser that
complies with Web standards. See <URL:http://www.mozilla.org/>.
.



Relevant Pages

  • Bots hitting my web server?
    ... I know of two boxes that had apache running on them. ... and used by someone/something to fetch pages from remote servers. ... cases, ads but in most cases, porn. ... they would come back in a torrent of requests. ...
    (Incidents)
  • Re: Apache and Tux running together
    ... my job we've got a web based product provided by Apache running PHP ... This web application is hosted by multiple servers ... and MySQL totalling 15 Megs of ram), ...
    (comp.lang.php)
  • FreeBSD 6 Jails - REJ apache processes? [was: Apache 2 in 6.0 jails: Connection refused: connect
    ... Sorry to insist, really, but this bug is really annoying: today, two more apache servers have frozen while being scanner by a crawler: ... I did the same for sockstat and netstat -a, each time before and after the apache restart: ... Connection refused: connect to listener on 0.0.0.0:80 [Sat Jul ...
    (freebsd-questions)
  • Re: bill gates claim about security vulnerabilities per LOC in Unix versus Windows
    ... > of how their systems work. ... the most common systems in use, but didn't the latest NetCraft survey ... we can agree that the absolute populations of ISS and Apache servers are ...
    (SecProg)
  • Re: Can ping cannot browse
    ... I assume by 'browsing' you mean using a web browser. ... Also if you mix internal and external DNS servers, ... I infact removed the external DNS servers pointing to my ISP. ... I am not using my local proxy server on my browse LAN settings, ...
    (microsoft.public.windows.server.dns)