[Mauiusers] Maui Loses Node Features Occassionally

Adam Emerich aemerich at us.ibm.com
Tue Dec 11 13:43:01 MST 2007


We have recently expanded our Torque/Maui cluster to include 32 more
compute nodes.  These nodes are separated out into their own group (cu03),
but we have found that after running for a couple of weeks Maui now does
not recognize some of the properties listed in the Torque server_priv/nodes
file.  Here is an example:

      Qsub Command:  qsub -I -lnodes=1:cu02:ppn=4

      Maui Log:
      12/11 14:32:12 MReqCheckResourceMatch(3494,0,p03-001-0,NULL)
      12/11 14:32:12 INFO:     inadequate features ([cu02] needed
[compute][opteron][dev][r7p02] found)
      12/11 14:32:12 MReqCheckResourceMatch(3494,0,p03-002-0,NULL)
      12/11 14:32:12 INFO:     inadequate features ([cu02] needed
[compute][opteron][dev][r7p02] found)
      12/11 14:32:12 MReqCheckResourceMatch(3494,0,p03-003-0,NULL)
      12/11 14:32:12 INFO:     inadequate features ([cu02] needed
[compute][opteron][dev][r7p02] found)
      12/11 14:32:12 MReqCheckResourceMatch(3494,0,p03-004-0,NULL)
      12/11 14:32:12 INFO:     inadequate features ([cu02] needed
[compute][opteron][dev][r7p02] found)
      12/11 14:32:12 MReqCheckResourceMatch(3494,0,p03-005-0,NULL)
      12/11 14:32:12 INFO:     inadequate features ([cu02] needed
[compute][opteron][dev][r7p02] found)
      12/11 14:32:12 MReqCheckResourceMatch(3494,0,p03-006-0,NULL)
      12/11 14:32:12 INFO:     inadequate features ([cu02] needed
[compute][opteron][dev][r7p02] found)
      12/11 14:32:12 MReqCheckResourceMatch(3494,0,p03-007-0,NULL)
      12/11 14:32:12 INFO:     inadequate features ([cu02] needed
[compute][opteron][dev][r7p02] found)

      Torque nodes file:
      p03-001-0 np=4 compute cu03 bcrack1 r7p02 p03-001 opteron
DRV_20071210 dev
      p03-002-0 np=4 compute cu03 bcrack1 r7p02 p03-002 opteron
DRV_20071210 dev
      p03-003-0 np=4 compute cu03 bcrack1 r7p02 p03-003 opteron
DRV_20071210 dev
      p03-004-0 np=4 compute cu03 bcrack1 r7p02 p03-004 opteron
DRV_20071210 dev
      p03-005-0 np=4 compute cu03 bcrack1 r7p02 p03-005 opteron
DRV_20071210 dev
      p03-006-0 np=4 compute cu03 bcrack1 r7p02 p03-006 opteron
DRV_20071210 dev
      p03-007-0 np=4 compute cu03 bcrack1 r7p02 p03-007 opteron
DRV_20071210 dev

  As you can see the feature list maui found when finding nodes for the job
does not contain all the features listed in the nodes file.  If the job
were directed to cu03, we get nodes selected out of the whole group rather
than the new 32 in cu03.  We have tried to reboot and also restart maui
without being able to resolve the issue.  I have found that if the cu03
entries are removed maui is restarted and then the cu03 entries are
replaced and maui restarted it seems to work for a while.

      Code Levels:
      torque-2.1.8.tar.gz
      maui-3.2.6p19.tar.gz

Any help would be appreciated.

Adam Emerich



More information about the mauiusers mailing list