Re: Random Failed Connection on Port 25 from CSS Keep-Alive
- From: Kees Theunissen <theuniss@xxxxxxxx>
- Date: Thu, 13 Mar 2008 02:42:54 +0100
sunckell wrote:
Got a Strange one here.
We are running Sendmail 8.13.8+Sun/8.13.8 on a fully patched
Solaris 10 server. Some business users wanted notifications if
sendmail were to stop responding. So we added a Keep-alive to our
CSS, all it does is connects to the port issues a helo and looks for a
250 in return. This runs every 5 secs. If it can't connect to the
port or does not receive a 250 response, it marks it down and
notifies.
We added this last Friday, and since then we are seeing several
"failures". The strange part is, the service is still running, there
is nothing odd in the logs, except you don't see the CSS connects
during the time period the CSS sees the port as down (or any other
traffic for that matter.)
We have seen time gaps in the logs up to 40 secs.
We have snooped the traffic and everything appears to be correct,
(from an ACK/SYN/FIN) standpoint.
We have tested this on other Solaris 10 servers on the network and
seen this issue occur on other servers.
We tested a Solaris 8 server running Sendmail 8.11.7p1+Sun/8.11.7 and
have not seen the issues. (Solair 8 servers on the same network
segments we see the failures for the Sendmail 8.11.7p1+Sun/8.11.7
servers.
There is no High memory usage, high cpu usage, or high Disk IO
occurring at the time of the failures. Just for some reason the Port
stops listening for random amounts of time.
Would anyone have any other ideas to help troubleshoot this issue? I
am running out of ideas, the next step I guess is to open a case with
Sun, but I was hoping to avoid that.
I am hoping someone else out there has experience in troubleshooting
this sort of issue and can give me some suggestions.
I've seen this or something similar once, last year, in a badly
configured network -my own network unfortunately- :-( .
It's off topic on this newsgroup, but nevertheless I'll give a short
description.
A few months ago we renewed our complete network infrastructure. Our
old network was "organically grown" over the years; build up and
maintained by one of our formerly ict-staff members. He has left our institute some time ago.
The core of the network was our gateway; a Foundry BigIron 4000.
Connected to the BI 4000 was a group of managed switches. I'll call
this the level one switches. All our servers were connected to these
level one switches. This group contained a broad range of brands
(Cisco, D-link, Dell), models an ages.
Between the first level switches and the users' workstations we had
a second level of cheap unmanaged switches; again several brands,
models, ages.
The network functioned without problems, but our demands were growing
and lots of the switches had past the end of support date.
So we were looking for a new set of switches.
About half a year ago we were testing one of the candidates for the
outer level of switches: a not too expensive managed Dell switch.
During this test we connected a few of our ict-staff members'
workstations through this switch to one of the level one switches.
Most of the settings of the switch under test were factory defaults.
In that configuration we noticed network delays like you described
for some of the traffic. To be precise: every few minutes the switch
under test blocked all traffic from one of our servers to the
connected workstations (just the traffic from _one_ server and only
in _that_ direction iirc). Gaps ranged from a few to nearly fifty
seconds. Sniffing the traffic we saw a coincidence between the gaps
and STP packets transmitted by the switch. So the switch apparently
tried to detect the network configuration. The server affected
happened to be our only multi-homed system - connected to several
VLAN's (but with routing / packet forwarding between the VLAN's
disabled). STP was not (actively) in use on the network but was
not switched off on all of the switches. :-(
We didn't further debugging - this had taken way too much time already.
But we had strong indications that this was related to the use of the
STP protocol on a network that was not properly configured for that
protocol.
We "solved" the problem by simply disconnecting the switch under test.
We had done enough testing by then, and we had already decided _not_
to buy that switch. This incident was only an extra argument for
that decision.
We went for a single manufacturer who could deliver all the network
equipment we needed, for a single support address in case of weird
problems, for similar features and supported protocols in all
switches and a similar user interface, for a single program to
configure and monitor the whole network.
Our old "organically grown" network was not optimally manageable -to
say it politely- because of the wild mix of components. And that
probably was the main cause of this particular weird error.
HTH somehow.
Regards,
Kees.
--
Kees Theunissen.
.
- Follow-Ups:
- Re: Random Failed Connection on Port 25 from CSS Keep-Alive
- From: sunckell
- Re: Random Failed Connection on Port 25 from CSS Keep-Alive
- References:
- Random Failed Connection on Port 25 from CSS Keep-Alive
- From: sunckell
- Random Failed Connection on Port 25 from CSS Keep-Alive
- Prev by Date: Re: Trapping HELO randomization
- Next by Date: Re: Trapping HELO randomization
- Previous by thread: Random Failed Connection on Port 25 from CSS Keep-Alive
- Next by thread: Re: Random Failed Connection on Port 25 from CSS Keep-Alive
- Index(es):
Relevant Pages
|
Loading