[torqueusers] Re: [Mauiusers] Deferred jobs

Steve Young chemadm at hamilton.edu
Thu Dec 11 10:43:53 MST 2008


think the cc got mixed up not sure if it made it to the list.

It could be possible ssh is a problem. However, if it were I'd think  
it would be consistently not working. It sounds like sometimes it does  
work?

-Steve


On Dec 11, 2008, at 12:05 PM, Philip Peartree wrote:

> Is it possible it could be an ssh problem, since:
>
> a) that is how internode communication is handled
> b) there seems to be nothing showing up on the pbs_mom logs on the  
> nodes
> c) the problem I had fixed was to do with that
>
> Any ideas guys?
>
>
> Quoting "Steve Young" <chemadm at hamilton.edu>:
>
>> Hmm... I'm not sure... was hoping someone else would chime in with  
>> some
>> idea's too =). Let see if any one else pipes up.
>>
>> -Steve
>>
>>
>>
>> On Dec 11, 2008, at 11:51 AM, Philip Peartree wrote:
>>
>>> Checknode and pbsnodes -a show the node ok, and releasehold tries   
>>> to run the job, but it returns to deferred status
>>>
>>>
>>> Quoting "Steve Young" <chemadm at hamilton.edu>:
>>>
>>>> Hi Phillip,
>>>> 	How about checknode on the node it was trying to run on? Does it  
>>>> see
>>>> the node ok? Or possibly pbsnodes -a <nodename> to make sure that
>>>> torque is seeing the node properly?  I'm just grasping at straw's  
>>>> here
>>>> =).... if you run releasehold <jobid>  does the job run after that?
>>>>
>>>> -Steve
>>>>
>>>> On Dec 11, 2008, at 10:48 AM, Philip Peartree wrote:
>>>>
>>>>> I now have this problem on a different cluster (but again  
>>>>> running   torque and maui)
>>>>>
>>>>> Checkjob for the job gives:
>>>>>
>>>>> State: Idle  EState: Deferred
>>>>> Creds:  user:mcdiypp2  group:nmrc  class:med_12h  qos:DEFAULT
>>>>> WallTime: 00:00:00 of 6:00:00
>>>>> SubmitTime: Thu Dec 11 15:24:45
>>>>> (Time Queued  Total: 00:19:55  Eligible: 00:00:01)
>>>>>
>>>>> StartDate: -00:19:53  Thu Dec 11 15:24:47
>>>>> Total Tasks: 32
>>>>>
>>>>> Req[0]  TaskCount: 32  Partition: ALL
>>>>> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
>>>>> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
>>>>>
>>>>>
>>>>> IWD: [NONE]  Executable:  [NONE]
>>>>> Bypass: 0  StartCount: 1
>>>>> PartitionMask: [ALL]
>>>>> Flags:       RESTARTABLE
>>>>>
>>>>> job is deferred.  Reason:  RMFailure  (cannot start job - RM    
>>>>> failure, rc: 15041, msg: 'Execution server rejected request    
>>>>> MSG=cannot send job to mom, state=PRERUN')
>>>>> Holds:    Defer  (hold reason:  RMFailure)
>>>>> PE:  32.00  StartPriority:  1
>>>>> cannot select job 157 for partition DEFAULT (job hold active)
>>>>>
>>>>> Having looked this up on google, it says it might be a torque    
>>>>> problem, but the basic problem (as I see it) seems to be that  
>>>>> two   jobs are assigned to the same set of processors/nodes, and  
>>>>> I   thought that this is the job of maui. This has happened   
>>>>> previously,  and resolved itself (admittedly while another   
>>>>> problem was being  sorted)
>>>>>
>>>>> I have checked the logs on the nodes affected and there is   
>>>>> nothing  to say if it even got the job at all!!!
>>>>>
>>>>>
>>>>> Quoting "Steve Young" <chemadm at hamilton.edu>:
>>>>>
>>>>>> Hi,
>>>>>> 	I was looking at the maui manual at:
>>>>>>
>>>>>> http://www.clusterresources.com/products/maui/docs/11.1jobholds.shtml
>>>>>>
>>>>>> What does checkjob tell you for that job?
>>>>>>
>>>>>> -Steve
>>>>>>
>>>>>> On Dec 11, 2008, at 9:40 AM, Philip Peartree wrote:
>>>>>>
>>>>>>> Does anyone have any ideas?
>>>>>>>
>>>>>>> Quoting "Philip Peartree"  
>>>>>>> <P.Peartree at postgrad.manchester.ac.uk>:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I'm having a problem with a torque/maui setup (hence the mail  
>>>>>>>> to both
>>>>>>>> lists). Submitted jobs are being deferred, and this primarily  
>>>>>>>> seems to
>>>>>>>> be because they're all requesting the same resource (node24  
>>>>>>>> at this
>>>>>>>> point). A qrun seems to shift them onto a correct node.
>>>>>>>>
>>>>>>>> My pbs_server log suggests that it's being rejected by the  
>>>>>>>> mom, and a
>>>>>>>> look at the logs on the mom shows a rejection going on with  
>>>>>>>> code 15004
>>>>>>>> and the job in unexpected state TRANSICM
>>>>>>>>
>>>>>>>> Can anyone help?
>>>>>>>>
>>>>>>>> Phil Peartree
>>>>>>>> University of Manchester
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> mauiusers mailing list
>>>>>>>> mauiusers at supercluster.org
>>>>>>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> mauiusers mailing list
>>>>>>> mauiusers at supercluster.org
>>>>>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>>>>
>>>>>> _______________________________________________
>>>>>> mauiusers mailing list
>>>>>> mauiusers at supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> mauiusers mailing list
>>>>> mauiusers at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>>
>>>> _______________________________________________
>>>> mauiusers mailing list
>>>> mauiusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>>
>>>
>>>
>>
>>
>
>



More information about the torqueusers mailing list