Re: Troubleshooting SANs



Sean,

I'm going to guessing that this wasn't a FC problem. I'm more inclined to believe it was a SCSI problem. Specifically I would guess that the blade you closed down was doing Target Resets.

If an initiator sends a target reset to a target and this target is providing LUNs for multiple initiators, all the outstanding IOs to all the initiators get reset. The initiators time out and retry the IO which succeeds. The end result is all the initiators slow down but no errors are displayed. Zoning won't help.

You can limit the possible suspects by seeing which initiators are slowing down and which target they have in common. The HP box might provide some higher debug level that exposes target resets so you can track them down.

From my experience, the most likely culprit is a Window 2003 SP1 cluster node (probably with an older storport driver.) I suggest whenever you see this problem just upgrade all the Windows clusters and all the storport drivers.

Follow http://support.microsoft.com/default.aspx?scid=kb;EN-US;923830

MSCS use resets to decide quorum ownership and when they get in a pickle, the do too many resets. Too many resets show up as slow storage. Cluster Nodes do log resets in the cluster log, although they don't call them resets, look for /arbitrat/ as in arbitration or something like that.

There is also the Emulex TPRLO command which is an FC issue. You can research TPRLOs. If the offending blade had Emulex cards see if TPRLO was enabled. (By default it shouldn't be and if it is you'll get the same problems).







seanh012@xxxxxxxxx wrote:
I work for a consulting firm, and have begun to do troubleshooting on
small SANs, mostly HP MSA1500cs based.

Many times the problem the customer is talking about is some vague
intermittent slowness issue or something like that. In cases like
this, my troubleshooting goes something like this:

1. Check switch logs for marginal ports or other errors (usually
brocade 4/24s or similar)
2. Update to latest firmware and driver levels on HBAs, Switch, MSA,
etc.

If the problem still exists, I'll call HP support, but more often than
not they can't really help from here. So the only approach that
yields results is to start unplugging stuff until I see the problem
disappear.

In one recent instance, I had a customer start shutting blades off
until he found that one of them had an HBA that was mysteriously
causing the intermittent slowness for the whole SAN. The HBA actually
seemed to work, and there were no errors in the Windows event logs, or
switch logs, sansurfer, or anything.

There has got to be a better way to find this kind of thing. On an IP
network, I would run Ethereal or some other packet analyzer to try and
see what is talking on the network when the problem manifests. But
I've never really found anything like that for a fibre channel SAN.

As I said, I'm pretty new to SAN, so any direction would be helpful.

Thanks,
Sean

.