[Mauiusers] job losing HOSTLIST
Kenneth Young
ykyoung at clustertech.com
Tue Mar 15 04:08:08 MST 2005
Hi,
Encountered problems with hostlist disappearing / corrupting when
submitting a job with
> qsub -h -lnodes=2:ppn=2 -W
x=HOSTLIST:node_003,node_003,node_004,node_004 xhpl.sh
PROBLEM 1: the hostlist got changed - checkjob shows hostlist= node_001
and node_002 instead
and the job subsequently ran and failed because these nodes do not have
the hardware I requested.
PROBLEM 2: the entire hostlist is gone
The -h flag holds the job immediately after submit. And I execute a
checkjob immediately a checkjob
without otherwise altering the job. Sometimes (although rarely) I got
output which says the job lost
the following:
- the Hostlist itself is missing
- Flags: RESTARTABLE (lost the HOSTLIST flag)
This happens once every 50-100 times I do a submit. I have not been
able to isolate a factor that
determistically triggers the error.
I am running torque-1.1.0 and maui-3.2.6. The verbose output of
checkjob is enclosed towards the end.
Any help is much appreciated.
Regards,
Kenneth
====================================================================
> checkjob 1055
checking job 1055
State: Hold
Creds: user:ykyoung group:ykyoung class:verylong qos:DEFAULT
WallTime: 00:00:00 of INFINITY
SubmitTime: Tue Mar 15 18:40:48
(Time Queued Total: 00:02:00 Eligible: 00:00:00)
Total Tasks: 1
Req[0] TaskCount: 4 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 0
PartitionMask: [ALL]
Flags: RESTARTABLE
PE: 4.00 StartPriority: 2
cannot select job 1055 for partition DEFAULT (non-idle state 'Hold')
More information about the mauiusers
mailing list