[torqueusers] Re: [Mauiusers] Deferred jobs

Josh Butikofer josh at clusterresources.com
Thu Dec 11 11:38:03 MST 2008


Philip,

What node is the job trying to run on when it gets this error message? 
Also, is the job you are trying to qrun named 158? If not, then I 
suspect that a job 158 is clogging successful runs on that pbs_mom.

Run a "momctl -d 3 -h <NODENAME>" on that node and send us the output. 
This will tell us what that pbs_mom believes the status quo is.

Josh Butikofer
Cluster Resources, Inc.
#############################


Philip Peartree wrote:
> It was working yesterday, but when I came to run some more jobs, they 
> wouldn't go.
> 
> I just tried qrunning again and I get this error:
> 
> qrun: Execution server rejected request MSG=cannot send job to mom, 
> state=PRERUN 158.steel.mib.man.ac.uk
> 
> I've reported this error to the support contact at the manufacturer (who 
> did the initial install) so we'll wait and see what comes from that!
> 
> 
> Quoting "Steve Young" <chemadm at hamilton.edu>:
> 
>> think the cc got mixed up not sure if it made it to the list.
>>
>> It could be possible ssh is a problem. However, if it were I'd think it
>> would be consistently not working. It sounds like sometimes it does
>> work?
>>
>> -Steve
>>
>>
>> On Dec 11, 2008, at 12:05 PM, Philip Peartree wrote:
>>
>>> Is it possible it could be an ssh problem, since:
>>>
>>> a) that is how internode communication is handled
>>> b) there seems to be nothing showing up on the pbs_mom logs on the nodes
>>> c) the problem I had fixed was to do with that
>>>
>>> Any ideas guys?
>>>
>>>
>>> Quoting "Steve Young" <chemadm at hamilton.edu>:
>>>
>>>> Hmm... I'm not sure... was hoping someone else would chime in with some
>>>> idea's too =). Let see if any one else pipes up.
>>>>
>>>> -Steve
>>>>
>>>>
>>>>
>>>> On Dec 11, 2008, at 11:51 AM, Philip Peartree wrote:
>>>>
>>>>> Checknode and pbsnodes -a show the node ok, and releasehold tries   
>>>>> to run the job, but it returns to deferred status
>>>>>
>>>>>
>>>>> Quoting "Steve Young" <chemadm at hamilton.edu>:
>>>>>
>>>>>> Hi Phillip,
>>>>>>     How about checknode on the node it was trying to run on? Does 
>>>>>> it see
>>>>>> the node ok? Or possibly pbsnodes -a <nodename> to make sure that
>>>>>> torque is seeing the node properly?  I'm just grasping at straw's 
>>>>>> here
>>>>>> =).... if you run releasehold <jobid>  does the job run after that?
>>>>>>
>>>>>> -Steve
>>>>>>
>>>>>> On Dec 11, 2008, at 10:48 AM, Philip Peartree wrote:
>>>>>>
>>>>>>> I now have this problem on a different cluster (but again  
>>>>>>> running  torque and maui)
>>>>>>>
>>>>>>> Checkjob for the job gives:
>>>>>>>
>>>>>>> State: Idle  EState: Deferred
>>>>>>> Creds:  user:mcdiypp2  group:nmrc  class:med_12h  qos:DEFAULT
>>>>>>> WallTime: 00:00:00 of 6:00:00
>>>>>>> SubmitTime: Thu Dec 11 15:24:45
>>>>>>> (Time Queued  Total: 00:19:55  Eligible: 00:00:01)
>>>>>>>
>>>>>>> StartDate: -00:19:53  Thu Dec 11 15:24:47
>>>>>>> Total Tasks: 32
>>>>>>>
>>>>>>> Req[0]  TaskCount: 32  Partition: ALL
>>>>>>> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
>>>>>>> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
>>>>>>>
>>>>>>>
>>>>>>> IWD: [NONE]  Executable:  [NONE]
>>>>>>> Bypass: 0  StartCount: 1
>>>>>>> PartitionMask: [ALL]
>>>>>>> Flags:       RESTARTABLE
>>>>>>>
>>>>>>> job is deferred.  Reason:  RMFailure  (cannot start job - RM    
>>>>>>> failure, rc: 15041, msg: 'Execution server rejected request    
>>>>>>> MSG=cannot send job to mom, state=PRERUN')
>>>>>>> Holds:    Defer  (hold reason:  RMFailure)
>>>>>>> PE:  32.00  StartPriority:  1
>>>>>>> cannot select job 157 for partition DEFAULT (job hold active)
>>>>>>>
>>>>>>> Having looked this up on google, it says it might be a torque   
>>>>>>>  problem, but the basic problem (as I see it) seems to be that  
>>>>>>> two  jobs are assigned to the same set of processors/nodes, and 
>>>>>>>  I   thought that this is the job of maui. This has happened   
>>>>>>> previously,  and resolved itself (admittedly while another   
>>>>>>> problem was being  sorted)
>>>>>>>
>>>>>>> I have checked the logs on the nodes affected and there is   
>>>>>>> nothing  to say if it even got the job at all!!!
>>>>>>>
>>>>>>>
>>>>>>> Quoting "Steve Young" <chemadm at hamilton.edu>:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>     I was looking at the maui manual at:
>>>>>>>>
>>>>>>>> http://www.clusterresources.com/products/maui/docs/11.1jobholds.shtml 
>>>>>>>>
>>>>>>>>
>>>>>>>> What does checkjob tell you for that job?
>>>>>>>>
>>>>>>>> -Steve
>>>>>>>>
>>>>>>>> On Dec 11, 2008, at 9:40 AM, Philip Peartree wrote:
>>>>>>>>
>>>>>>>>> Does anyone have any ideas?
>>>>>>>>>
>>>>>>>>> Quoting "Philip Peartree" <P.Peartree at postgrad.manchester.ac.uk>:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I'm having a problem with a torque/maui setup (hence the mail 
>>>>>>>>>> to both
>>>>>>>>>> lists). Submitted jobs are being deferred, and this  primarily 
>>>>>>>>>> seems to
>>>>>>>>>> be because they're all requesting the same resource (node24 at 
>>>>>>>>>> this
>>>>>>>>>> point). A qrun seems to shift them onto a correct node.
>>>>>>>>>>
>>>>>>>>>> My pbs_server log suggests that it's being rejected by the 
>>>>>>>>>> mom, and a
>>>>>>>>>> look at the logs on the mom shows a rejection going on with  
>>>>>>>>>> code 15004
>>>>>>>>>> and the job in unexpected state TRANSICM
>>>>>>>>>>
>>>>>>>>>> Can anyone help?
>>>>>>>>>>
>>>>>>>>>> Phil Peartree
>>>>>>>>>> University of Manchester
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> mauiusers mailing list
>>>>>>>>>> mauiusers at supercluster.org
>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> mauiusers mailing list
>>>>>>>>> mauiusers at supercluster.org
>>>>>>>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> mauiusers mailing list
>>>>>>>> mauiusers at supercluster.org
>>>>>>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> mauiusers mailing list
>>>>>>> mauiusers at supercluster.org
>>>>>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>>>>
>>>>>> _______________________________________________
>>>>>> mauiusers mailing list
>>>>>> mauiusers at supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
> 
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list