[torqueusers] job defers
akshar bhosale
akshar.bhosale at gmail.com
Thu Nov 1 19:04:41 MDT 2012
hi,
we have cluster where os is rhel 5.2, pbs version is : 2.5.8 and maui
version is : 3.2.6p21 and 256 nodes.
some times the job submitted by the user goes in the deferred state
instead of going for execution or in the queue. Following error message is
show when checkjob command is fired after performing releasehold <job id>,
then it goes for either execution or in the queue from differed state. It
says connection to mom time out, but node is very much online.
error :
##################################################
checking job 8210
State: Idle EState: Deferred
Creds: user:john group:chem account:dadopr class:chemo qos:DEFAULT
WallTime: 00:00:00 of 1:00:00
SubmitTime: Thu Nov 1 15:15:13
(Time Queued Total: 00:29:00 Eligible: 00:00:02)
Total Tasks: 1
Req[0] TaskCount: 1 Partition: par1
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 1
PartitionMask: [ALL]
Flags: RESTARTABLE
job is deferred. Reason: RMFailure (cannot start job - RM failure, rc:
15043, msg: 'Execution server rejected request MSG=connection to mom timed
out')
Holds: Defer (hold reason: RMFailure)
PE: 1.00 StartPriority: 1
cannot select job 8210 for partition par1 (job hold active)
cannot select job 8210 for partition par2 (job hold active)
#########################################################################
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121102/567b53c4/attachment.html
More information about the torqueusers
mailing list