Tons of Sweat, then Sweet Reliief



I have a couple Oracle servers I manage. One of them had a hard
disk failure...

So Thursday last week, I notice an error message in the event logs
on a particular machine. I take a look at the equipment and see a
flashing yellow lite on a disk drive. That can't be good.

I set up servers that manage databases with a RAID 1 mirror set for
the operating system and applications. The database and it's archive
files live in a RAID 5 Array with hot swappable drives. Performance
is good this way, and I have only ever had one drive fail, and that
was in a RAID 5 drive. Pop the old one out, pop a new one in and it
rebuilds on the fly.

So, I am confronted with the flashing yellow LED, and go to the
manufacturer to get a replacement drive. I order the drive, and it
comes in on Monday. So far so good.

Then I touch it....

For the following, I'll refer you to a troubleshooting flowchart I
have seen (questionable language warning):
<http://ylatis.com/darkon/humor/flochart.html>

So, I touch it. First I back the thing up. Ghost is your friend
even for servers. I back up an image file to the RAID 5 array. I
reboot and install the new drive.

The new drive does not seem to fit in the carrier. It is a
different type than what was in there to start with. Maybe the
connector is just tight; it is a new drive... I apply a wee bit 'o
force, and it jams... Damn. I pull it back out. I reinstall the
original to see what happens, and the damned machine won't reboot. Aw
damn!!!

This system has a collector node that collects the data for up to a
couple months. It does a store and forward when the connection is
restored. I am now relying on that...

Do you see the sweat? There is several years of data out there;
backed up though it is, I have severe anxiety about being able to
recover the system to a point that we can recover the data. Some of
it is required by regulatory permits from the EPA.

This is stressful. I remove the BUS cage the drives go into, and
inspect it for physical damage. I see none. No markings or bent
anythings. I put the machine back together, and.... It still will
not reboot. There is an invalid drive for booting. Damn!!!

There are times in life when one wishes they had not started on a
certain course, and this was one of them. Just to be sure I approach
things calmly, I decide to walk away for a while, and attend to other
tasks to allow a calmness to happen, rather than deal with panic.
DON'T PANIC!

I go out into the field and complete a client install that failed
Thursday because the IT people screwed up an install image. I get the
mail because the powers that be decided engineers should pick up the
mail rather than get it delivered by the service group, to save a few
dollars <not sure what they are thinking on that>.

OK, I am calm!

I walk up to the system, holding my mouth just right, I go through
the SCSI select utility, and the Array Controller configuration
utility and get things where I think I want them, then reboot.

GLORY HALLELUJAH! The damned thing reboots, and I am getting the
stored data to the database. I'm still only operating on one drive
for the mirror, and the NVRAM on the controller is unhappy. There is
an ugly orange lite that flashes it's unhappiness to the world. That
is not a good sign, but I can live with that.

It is late evening, so I think I'll ponder this a wee bit more and go
home. Let the system catch up. Let me catch up.

Next day!

I go in early. I stop the system and remove the failed drive. I
install the new drive in the carrier, and it goes in as it should. Not
sure why things did not go so smoothly the day before...

I open the array configuration utility and tell it to rebuild the
drive. It takes an hour and a half, but it says it is successful. I
reboot the machine.

It reboots. The ugly flashing orange LED stops, and the LEDs on the
mirror are all green. I check the application, and the thing is
recovering data from the collector node as was intended.

The moral of the story is, at least for me, you can't depend on
anything but pure blind luck! Sh^h^htuff happens, and there is not a
damned thing you can do about it but pray it works out OK in the end.

Backups are running again tonight. Life is good, and nobody
knows...
--

Cheers! :)
.



Relevant Pages

  • Re: very slow
    ... can use various software packages and external drives to accomplish the same ... If any version of Norton or McAfee antivirus or full Internet Protection ... This will require a reboot to check the system drive. ... Download, install, run, update and perform a full scan with the following ...
    (microsoft.public.windows.vista.general)
  • Re: Merge Installation/Configuration Problem on OpenServer 5.0.7
    ... special files directory on a secondary hard drive, ... Shutdown the system and reboot on the 5.04 installation diskette. ... Begin fresh install, ... Run mkdev hd to add the drive or drives listed in the mntfs file. ...
    (comp.unix.sco.misc)
  • Re: very slow
    ... can use various software packages and external drives to accomplish the same ... If any version of Norton or McAfee antivirus or full Internet Protection ... This will require a reboot to check the system drive. ... Download, install, run, update and perform a full scan with the following ...
    (microsoft.public.windows.vista.general)
  • Re: very slow
    ... can use various software packages and external drives to accomplish the same ... If any version of Norton or McAfee antivirus or full Internet Protection ... This will require a reboot to check the system drive. ... Download, install, run, update and perform a full scan with the following ...
    (microsoft.public.windows.vista.general)
  • Re: Moving the event logs from one disk to the other?
    ... As long as they're local drives I don't it as a problem. ... I like to install SQL server application to default 'Program Files' ... | my Windows 2003 SP1 servers. ...
    (microsoft.public.windows.server.general)