[Mauiusers] Deferred jobs

Steve Young chemadm at hamilton.edu
Thu Dec 11 09:40:09 MST 2008


Hi Phillip,
	How about checknode on the node it was trying to run on? Does it see  
the node ok? Or possibly pbsnodes -a <nodename> to make sure that  
torque is seeing the node properly?  I'm just grasping at straw's here  
=).... if you run releasehold <jobid>  does the job run after that?

-Steve

On Dec 11, 2008, at 10:48 AM, Philip Peartree wrote:

> I now have this problem on a different cluster (but again running  
> torque and maui)
>
> Checkjob for the job gives:
>
> State: Idle  EState: Deferred
> Creds:  user:mcdiypp2  group:nmrc  class:med_12h  qos:DEFAULT
> WallTime: 00:00:00 of 6:00:00
> SubmitTime: Thu Dec 11 15:24:45
>  (Time Queued  Total: 00:19:55  Eligible: 00:00:01)
>
> StartDate: -00:19:53  Thu Dec 11 15:24:47
> Total Tasks: 32
>
> Req[0]  TaskCount: 32  Partition: ALL
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
>
>
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 1
> PartitionMask: [ALL]
> Flags:       RESTARTABLE
>
> job is deferred.  Reason:  RMFailure  (cannot start job - RM  
> failure, rc: 15041, msg: 'Execution server rejected request  
> MSG=cannot send job to mom, state=PRERUN')
> Holds:    Defer  (hold reason:  RMFailure)
> PE:  32.00  StartPriority:  1
> cannot select job 157 for partition DEFAULT (job hold active)
>
> Having looked this up on google, it says it might be a torque  
> problem, but the basic problem (as I see it) seems to be that two  
> jobs are assigned to the same set of processors/nodes, and I thought  
> that this is the job of maui. This has happened previously, and  
> resolved itself (admittedly while another problem was being sorted)
>
> I have checked the logs on the nodes affected and there is nothing  
> to say if it even got the job at all!!!
>
>
> Quoting "Steve Young" <chemadm at hamilton.edu>:
>
>> Hi,
>> 	I was looking at the maui manual at:
>>
>> http://www.clusterresources.com/products/maui/docs/11.1jobholds.shtml
>>
>> What does checkjob tell you for that job?
>>
>> -Steve
>>
>> On Dec 11, 2008, at 9:40 AM, Philip Peartree wrote:
>>
>>> Does anyone have any ideas?
>>>
>>> Quoting "Philip Peartree" <P.Peartree at postgrad.manchester.ac.uk>:
>>>
>>>> Hi,
>>>>
>>>> I'm having a problem with a torque/maui setup (hence the mail to  
>>>> both
>>>> lists). Submitted jobs are being deferred, and this primarily  
>>>> seems to
>>>> be because they're all requesting the same resource (node24 at this
>>>> point). A qrun seems to shift them onto a correct node.
>>>>
>>>> My pbs_server log suggests that it's being rejected by the mom,  
>>>> and a
>>>> look at the logs on the mom shows a rejection going on with code  
>>>> 15004
>>>> and the job in unexpected state TRANSICM
>>>>
>>>> Can anyone help?
>>>>
>>>> Phil Peartree
>>>> University of Manchester
>>>>
>>>> _______________________________________________
>>>> mauiusers mailing list
>>>> mauiusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> mauiusers mailing list
>>> mauiusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>
>> _______________________________________________
>> mauiusers mailing list
>> mauiusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>
>
>
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers



More information about the mauiusers mailing list