Kenneth Young ykyoung at clustertech.com
Tue Mar 15 04:08:08 MST 2005


Encountered problems with hostlist disappearing / corrupting when 
submitting a job with
 > qsub -h -lnodes=2:ppn=2 -W 
x=HOSTLIST:node_003,node_003,node_004,node_004 xhpl.sh

PROBLEM 1: the hostlist got changed - checkjob shows hostlist= node_001 
and node_002 instead
and the job subsequently ran and  failed because these nodes do not have 
the hardware I requested.

PROBLEM 2: the entire hostlist is gone
The -h flag holds the job immediately after submit.  And I execute a 
checkjob immediately a checkjob
without otherwise altering the job.  Sometimes (although rarely) I got 
output which says the job lost
the following:
 - the Hostlist itself is missing
 - Flags:        RESTARTABLE (lost the HOSTLIST flag)

This happens once every 50-100 times I do a submit.  I have not been 
able to isolate a factor that
determistically triggers the error. 

I am running torque-1.1.0 and maui-3.2.6.  The verbose output of 
checkjob is enclosed towards the end.
Any help is much appreciated.


 > checkjob 1055

checking job 1055

State: Hold
Creds:  user:ykyoung  group:ykyoung  class:verylong  qos:DEFAULT
WallTime: 00:00:00 of   INFINITY
SubmitTime: Tue Mar 15 18:40:48
  (Time Queued  Total: 00:02:00  Eligible: 00:00:00)

Total Tasks: 1

Req[0]  TaskCount: 4  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]

IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 0
PartitionMask: [ALL]
Flags:       RESTARTABLE

PE:  4.00  StartPriority:  2
cannot select job 1055 for partition DEFAULT (non-idle state 'Hold')

