Re: CPU Node board failure on Origin 2000
- From: Joerg Behrens <behrens@xxxxxxxxxx>
- Date: Sat, 04 Mar 2006 01:54:24 +0100
Peter van Heusden schrieb:
Hi there. Here at SANBI we have an Origin 2000 made up of two machines
in one cabinet. Historically they have appeared a single 'virtual machine',
Hmm.... if the machine is still conneted with 2 (max 4.) craylink cables you'll have a single machine. You can check this when open the right baffle and looking for thicker cables.
I think with 'virtual machine' you mean that your sytems use the partition feature. With that you can create smaller sub units from a larger installation. The smallest subunit in an origin 2000 is one module and each runs its own irix kernel (installation).
But... i never use the partition feature so i cant spend some more specific hints how to deal when having a problem there. I cant believe that you can run a system which uses partition without the MMSC because the MMSC is needed to shotdown and restart a specific partion.
with the 'bottom' one playing the role of 'master'. A few weeks ago,
the machine rebooted itself, and the 'top' machine took over as 'master'.
And now, if you try and boot up the bottom machine - or both machines - you get an error message like this:
If an origin went down the system create a FRU analyse under /var/adm/crash. You will also find some infos in the SYSLOG file. When having no need for the FRU files you can delete them to get the disk space back.
.......
DONE
Checking
partitioning information ......... DONE
Loading BASEIO prom ....................... DONE
BASEIO PROM Monitor SGI Version 6.111 built 09:43:30 AM May 24,
2002
This looks like and older IRIX installation because latest PROM is 6.156 from Nov 18, 2003.
(BE64) 13 CPUs on 7 nodes found.
****************************************************************
* PANIC: Boards in same module show different moduleids. *
* PANIC: Failed to automatically assign moduleid(s) *
* Please assign globally unique module id(s) at the MSC. *
****************************************************************
When the origin modules are craylinked together each one needs an unique id. In an standard installation the lower module of rack1 is numbered with '1' and the upper is '2'. The lower module from a 2nd. rack get '3' and so on.
If a module lost its configuration, or when clearing up the logs from POD, or after changing the MSC it can be happend that two or more modules use the same id. So you have to asign the ids manually by enter the command line modus from the maintenance menu. Type in 'help' to get a list. Im sure there is an command like 'moduleid' or just 'module'. With this command you can get the current id and also assign a new one.
So shutdown both installations and restart only the lower module. When possible let the system boot into the IRIX OS. The MSC than shows you the current ID. The LEDs shows something like 'P0M 1 C' which means "Partion 0,Module 1, Console". Shutdown the system and try the same step with the upper module. If both uses the same ID you have to re-assign one of them. Restart the module and enter then maintenance menu. Press '5' for the commandline menu. Type in 'moduleid 1' for example followed by 'update' to save the new configuration. After this restart the systems.
Something similar can be happend when moving/replacing nodeboards from one module into another. I my case (2 rack system with 32cpus) i try clearallogs and initlogs from the POD. After this the system starts to re-number all nodes and modules. But dont try this until you have check howto setup your partions!
Take a look to
http://forums.nekochan.net/viewtopic.php?t=883&view=next
regards
Joerg
.
- Follow-Ups:
- References:
- CPU Node board failure on Origin 2000
- From: Peter van Heusden
- CPU Node board failure on Origin 2000
- Prev by Date: Re: [newbie] irix on SGI Indy
- Next by Date: Re: [newbie] irix on SGI Indy
- Previous by thread: CPU Node board failure on Origin 2000
- Next by thread: Re: CPU Node board failure on Origin 2000
- Index(es):
Relevant Pages
|