[torqueusers] job defers

akshar bhosale akshar.bhosale at gmail.com
Thu Nov 1 19:04:41 MDT 2012


hi,
we have cluster where os is rhel 5.2, pbs version is : 2.5.8 and maui
version is : 3.2.6p21 and 256 nodes.
some times the job submitted by the user  goes in the deferred state
instead of going for execution or in the queue. Following error message is
show when checkjob command is fired after performing releasehold <job id>,
then it goes for either execution or in the queue from differed state. It
says connection to mom time out, but node is very much online.

error :
##################################################
checking job 8210

State: Idle  EState: Deferred
Creds:  user:john  group:chem  account:dadopr  class:chemo  qos:DEFAULT
WallTime: 00:00:00 of 1:00:00
SubmitTime: Thu Nov  1 15:15:13
  (Time Queued  Total: 00:29:00  Eligible: 00:00:02)

Total Tasks: 1

Req[0]  TaskCount: 1  Partition: par1
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
Flags:       RESTARTABLE

job is deferred.  Reason:  RMFailure  (cannot start job - RM failure, rc:
15043, msg: 'Execution server rejected request MSG=connection to mom timed
out')
Holds:    Defer  (hold reason:  RMFailure)
PE:  1.00  StartPriority:  1
cannot select job 8210 for partition par1 (job hold active)

cannot select job 8210 for partition par2 (job hold active)
#########################################################################
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121102/567b53c4/attachment.html 


More information about the torqueusers mailing list