[torqueusers] Re: [Mauiusers] Deferred jobs

Philip Peartree P.Peartree at postgrad.manchester.ac.uk
Thu Dec 11 11:05:22 MST 2008


It was working yesterday, but when I came to run some more jobs, they  
wouldn't go.

I just tried qrunning again and I get this error:

qrun: Execution server rejected request MSG=cannot send job to mom,  
state=PRERUN 158.steel.mib.man.ac.uk

I've reported this error to the support contact at the manufacturer  
(who did the initial install) so we'll wait and see what comes from  
that!


Quoting "Steve Young" <chemadm at hamilton.edu>:

> think the cc got mixed up not sure if it made it to the list.
>
> It could be possible ssh is a problem. However, if it were I'd think it
> would be consistently not working. It sounds like sometimes it does
> work?
>
> -Steve
>
>
> On Dec 11, 2008, at 12:05 PM, Philip Peartree wrote:
>
>> Is it possible it could be an ssh problem, since:
>>
>> a) that is how internode communication is handled
>> b) there seems to be nothing showing up on the pbs_mom logs on the nodes
>> c) the problem I had fixed was to do with that
>>
>> Any ideas guys?
>>
>>
>> Quoting "Steve Young" <chemadm at hamilton.edu>:
>>
>>> Hmm... I'm not sure... was hoping someone else would chime in with some
>>> idea's too =). Let see if any one else pipes up.
>>>
>>> -Steve
>>>
>>>
>>>
>>> On Dec 11, 2008, at 11:51 AM, Philip Peartree wrote:
>>>
>>>> Checknode and pbsnodes -a show the node ok, and releasehold tries  
>>>>   to run the job, but it returns to deferred status
>>>>
>>>>
>>>> Quoting "Steve Young" <chemadm at hamilton.edu>:
>>>>
>>>>> Hi Phillip,
>>>>> 	How about checknode on the node it was trying to run on? Does it see
>>>>> the node ok? Or possibly pbsnodes -a <nodename> to make sure that
>>>>> torque is seeing the node properly?  I'm just grasping at straw's here
>>>>> =).... if you run releasehold <jobid>  does the job run after that?
>>>>>
>>>>> -Steve
>>>>>
>>>>> On Dec 11, 2008, at 10:48 AM, Philip Peartree wrote:
>>>>>
>>>>>> I now have this problem on a different cluster (but again   
>>>>>> running  torque and maui)
>>>>>>
>>>>>> Checkjob for the job gives:
>>>>>>
>>>>>> State: Idle  EState: Deferred
>>>>>> Creds:  user:mcdiypp2  group:nmrc  class:med_12h  qos:DEFAULT
>>>>>> WallTime: 00:00:00 of 6:00:00
>>>>>> SubmitTime: Thu Dec 11 15:24:45
>>>>>> (Time Queued  Total: 00:19:55  Eligible: 00:00:01)
>>>>>>
>>>>>> StartDate: -00:19:53  Thu Dec 11 15:24:47
>>>>>> Total Tasks: 32
>>>>>>
>>>>>> Req[0]  TaskCount: 32  Partition: ALL
>>>>>> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
>>>>>> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
>>>>>>
>>>>>>
>>>>>> IWD: [NONE]  Executable:  [NONE]
>>>>>> Bypass: 0  StartCount: 1
>>>>>> PartitionMask: [ALL]
>>>>>> Flags:       RESTARTABLE
>>>>>>
>>>>>> job is deferred.  Reason:  RMFailure  (cannot start job - RM     
>>>>>> failure, rc: 15041, msg: 'Execution server rejected request     
>>>>>> MSG=cannot send job to mom, state=PRERUN')
>>>>>> Holds:    Defer  (hold reason:  RMFailure)
>>>>>> PE:  32.00  StartPriority:  1
>>>>>> cannot select job 157 for partition DEFAULT (job hold active)
>>>>>>
>>>>>> Having looked this up on google, it says it might be a torque    
>>>>>>  problem, but the basic problem (as I see it) seems to be that   
>>>>>> two  jobs are assigned to the same set of processors/nodes, and  
>>>>>>  I   thought that this is the job of maui. This has happened    
>>>>>> previously,  and resolved itself (admittedly while another    
>>>>>> problem was being  sorted)
>>>>>>
>>>>>> I have checked the logs on the nodes affected and there is    
>>>>>> nothing  to say if it even got the job at all!!!
>>>>>>
>>>>>>
>>>>>> Quoting "Steve Young" <chemadm at hamilton.edu>:
>>>>>>
>>>>>>> Hi,
>>>>>>> 	I was looking at the maui manual at:
>>>>>>>
>>>>>>> http://www.clusterresources.com/products/maui/docs/11.1jobholds.shtml
>>>>>>>
>>>>>>> What does checkjob tell you for that job?
>>>>>>>
>>>>>>> -Steve
>>>>>>>
>>>>>>> On Dec 11, 2008, at 9:40 AM, Philip Peartree wrote:
>>>>>>>
>>>>>>>> Does anyone have any ideas?
>>>>>>>>
>>>>>>>> Quoting "Philip Peartree" <P.Peartree at postgrad.manchester.ac.uk>:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I'm having a problem with a torque/maui setup (hence the mail to both
>>>>>>>>> lists). Submitted jobs are being deferred, and this   
>>>>>>>>> primarily seems to
>>>>>>>>> be because they're all requesting the same resource (node24 at this
>>>>>>>>> point). A qrun seems to shift them onto a correct node.
>>>>>>>>>
>>>>>>>>> My pbs_server log suggests that it's being rejected by the mom, and a
>>>>>>>>> look at the logs on the mom shows a rejection going on with   
>>>>>>>>> code 15004
>>>>>>>>> and the job in unexpected state TRANSICM
>>>>>>>>>
>>>>>>>>> Can anyone help?
>>>>>>>>>
>>>>>>>>> Phil Peartree
>>>>>>>>> University of Manchester
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> mauiusers mailing list
>>>>>>>>> mauiusers at supercluster.org
>>>>>>>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> mauiusers mailing list
>>>>>>>> mauiusers at supercluster.org
>>>>>>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> mauiusers mailing list
>>>>>>> mauiusers at supercluster.org
>>>>>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> mauiusers mailing list
>>>>>> mauiusers at supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>>>
>>>>> _______________________________________________
>>>>> mauiusers mailing list
>>>>> mauiusers at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>




More information about the torqueusers mailing list