Re: Random Failed Connection on Port 25 from CSS Keep-Alive



sunckell wrote:
Got a Strange one here.

We are running Sendmail 8.13.8+Sun/8.13.8 on a fully patched
Solaris 10 server. Some business users wanted notifications if
sendmail were to stop responding. So we added a Keep-alive to our
CSS, all it does is connects to the port issues a helo and looks for a
250 in return. This runs every 5 secs. If it can't connect to the
port or does not receive a 250 response, it marks it down and
notifies.

We added this last Friday, and since then we are seeing several
"failures". The strange part is, the service is still running, there
is nothing odd in the logs, except you don't see the CSS connects
during the time period the CSS sees the port as down (or any other
traffic for that matter.)

We have seen time gaps in the logs up to 40 secs.
We have snooped the traffic and everything appears to be correct,
(from an ACK/SYN/FIN) standpoint.
We have tested this on other Solaris 10 servers on the network and
seen this issue occur on other servers.
We tested a Solaris 8 server running Sendmail 8.11.7p1+Sun/8.11.7 and
have not seen the issues. (Solair 8 servers on the same network
segments we see the failures for the Sendmail 8.11.7p1+Sun/8.11.7
servers.

There is no High memory usage, high cpu usage, or high Disk IO
occurring at the time of the failures. Just for some reason the Port
stops listening for random amounts of time.

Would anyone have any other ideas to help troubleshoot this issue? I
am running out of ideas, the next step I guess is to open a case with
Sun, but I was hoping to avoid that.

I am hoping someone else out there has experience in troubleshooting
this sort of issue and can give me some suggestions.

I've seen this or something similar once, last year, in a badly
configured network -my own network unfortunately- :-( .

It's off topic on this newsgroup, but nevertheless I'll give a short
description.

A few months ago we renewed our complete network infrastructure. Our
old network was "organically grown" over the years; build up and
maintained by one of our formerly ict-staff members. He has left our institute some time ago.

The core of the network was our gateway; a Foundry BigIron 4000.
Connected to the BI 4000 was a group of managed switches. I'll call
this the level one switches. All our servers were connected to these
level one switches. This group contained a broad range of brands
(Cisco, D-link, Dell), models an ages.
Between the first level switches and the users' workstations we had
a second level of cheap unmanaged switches; again several brands,
models, ages.
The network functioned without problems, but our demands were growing
and lots of the switches had past the end of support date.
So we were looking for a new set of switches.

About half a year ago we were testing one of the candidates for the
outer level of switches: a not too expensive managed Dell switch.
During this test we connected a few of our ict-staff members'
workstations through this switch to one of the level one switches.
Most of the settings of the switch under test were factory defaults.
In that configuration we noticed network delays like you described
for some of the traffic. To be precise: every few minutes the switch
under test blocked all traffic from one of our servers to the
connected workstations (just the traffic from _one_ server and only
in _that_ direction iirc). Gaps ranged from a few to nearly fifty
seconds. Sniffing the traffic we saw a coincidence between the gaps
and STP packets transmitted by the switch. So the switch apparently
tried to detect the network configuration. The server affected
happened to be our only multi-homed system - connected to several
VLAN's (but with routing / packet forwarding between the VLAN's
disabled). STP was not (actively) in use on the network but was
not switched off on all of the switches. :-(

We didn't further debugging - this had taken way too much time already.
But we had strong indications that this was related to the use of the
STP protocol on a network that was not properly configured for that
protocol.

We "solved" the problem by simply disconnecting the switch under test.
We had done enough testing by then, and we had already decided _not_
to buy that switch. This incident was only an extra argument for
that decision.

We went for a single manufacturer who could deliver all the network
equipment we needed, for a single support address in case of weird
problems, for similar features and supported protocols in all
switches and a similar user interface, for a single program to
configure and monitor the whole network.

Our old "organically grown" network was not optimally manageable -to
say it politely- because of the wild mix of components. And that
probably was the main cause of this particular weird error.
HTH somehow.

Regards,

Kees.

--
Kees Theunissen.
.



Relevant Pages

  • Re: Windows 2008 IPv6
    ... Like if I have my servers in a seperate VLAN or NETWORK.. ... are capable of VLAN's configure your switches to separate the servers ... The equallogic storage array is basically on it's own network. ...
    (microsoft.public.windows.server.networking)
  • Re: Network Design
    ... I am pretty much re-doing the entire network so I'm wondering best practice ... older servers from old network that I'll use to do things like (Anti-Virus ... I'm thinking two switches for the SAN that connects to servers on the ... Then two switches (1 for each subnet) that would connect to all servers ...
    (microsoft.public.windows.server.networking)
  • Re: Windows 2008 IPv6
    ... Routing is routing, whether you are using vlans or not. ... Like if I have my servers in a seperate VLAN or NETWORK.. ... I'm wondering if my authentication servers are on VLAN1 how do you get users from VLAN2 to authenticate and use services from VLAN1.. ... are capable of VLAN's configure your switches to separate the servers ...
    (microsoft.public.windows.server.networking)
  • Re: Layer 3 Etherchannel Issue
    ... I have bought a pair of Catalyst 3750G switches with the advanced IP ... balancing using Layer 3 by assigning the port channel an IP address. ... channel and it load balance using src-dst-ip to three backend servers ...
    (comp.dcom.sys.cisco)
  • Re: Windows 2008 IPv6
    ... Assuming that the servers with the 4 NIC's are DC's you should avoid ... are capable of VLAN's configure your switches to separate the servers ... The equallogic storage array is basically on it's own network. ...
    (microsoft.public.windows.server.networking)

Loading