[Mauiusers] Deferred jobs

Philip Peartree P.Peartree at postgrad.manchester.ac.uk
Thu Dec 11 08:48:15 MST 2008


I now have this problem on a different cluster (but again running  
torque and maui)

Checkjob for the job gives:

State: Idle  EState: Deferred
Creds:  user:mcdiypp2  group:nmrc  class:med_12h  qos:DEFAULT
WallTime: 00:00:00 of 6:00:00
SubmitTime: Thu Dec 11 15:24:45
   (Time Queued  Total: 00:19:55  Eligible: 00:00:01)

StartDate: -00:19:53  Thu Dec 11 15:24:47
Total Tasks: 32

Req[0]  TaskCount: 32  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
Flags:       RESTARTABLE

job is deferred.  Reason:  RMFailure  (cannot start job - RM failure,  
rc: 15041, msg: 'Execution server rejected request MSG=cannot send job  
to mom, state=PRERUN')
Holds:    Defer  (hold reason:  RMFailure)
PE:  32.00  StartPriority:  1
cannot select job 157 for partition DEFAULT (job hold active)

Having looked this up on google, it says it might be a torque problem,  
but the basic problem (as I see it) seems to be that two jobs are  
assigned to the same set of processors/nodes, and I thought that this  
is the job of maui. This has happened previously, and resolved itself  
(admittedly while another problem was being sorted)

I have checked the logs on the nodes affected and there is nothing to  
say if it even got the job at all!!!


Quoting "Steve Young" <chemadm at hamilton.edu>:

> Hi,
> 	I was looking at the maui manual at:
>
> http://www.clusterresources.com/products/maui/docs/11.1jobholds.shtml
>
> What does checkjob tell you for that job?
>
> -Steve
>
> On Dec 11, 2008, at 9:40 AM, Philip Peartree wrote:
>
>> Does anyone have any ideas?
>>
>> Quoting "Philip Peartree" <P.Peartree at postgrad.manchester.ac.uk>:
>>
>>> Hi,
>>>
>>> I'm having a problem with a torque/maui setup (hence the mail to both
>>> lists). Submitted jobs are being deferred, and this primarily seems to
>>> be because they're all requesting the same resource (node24 at this
>>> point). A qrun seems to shift them onto a correct node.
>>>
>>> My pbs_server log suggests that it's being rejected by the mom, and a
>>> look at the logs on the mom shows a rejection going on with code 15004
>>> and the job in unexpected state TRANSICM
>>>
>>> Can anyone help?
>>>
>>> Phil Peartree
>>> University of Manchester
>>>
>>> _______________________________________________
>>> mauiusers mailing list
>>> mauiusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>
>>
>>
>>
>> _______________________________________________
>> mauiusers mailing list
>> mauiusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/mauiusers
>
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers
>




More information about the mauiusers mailing list