[torqueusers] Re: [Mauiusers] Deferred jobs
Josh Butikofer
josh at clusterresources.com
Thu Dec 11 11:38:03 MST 2008
Philip,
What node is the job trying to run on when it gets this error message?
Also, is the job you are trying to qrun named 158? If not, then I
suspect that a job 158 is clogging successful runs on that pbs_mom.
Run a "momctl -d 3 -h <NODENAME>" on that node and send us the output.
This will tell us what that pbs_mom believes the status quo is.
Josh Butikofer
Cluster Resources, Inc.
#############################
Philip Peartree wrote:
> It was working yesterday, but when I came to run some more jobs, they
> wouldn't go.
>
> I just tried qrunning again and I get this error:
>
> qrun: Execution server rejected request MSG=cannot send job to mom,
> state=PRERUN 158.steel.mib.man.ac.uk
>
> I've reported this error to the support contact at the manufacturer (who
> did the initial install) so we'll wait and see what comes from that!
>
>
> Quoting "Steve Young" <chemadm at hamilton.edu>:
>
>> think the cc got mixed up not sure if it made it to the list.
>>
>> It could be possible ssh is a problem. However, if it were I'd think it
>> would be consistently not working. It sounds like sometimes it does
>> work?
>>
>> -Steve
>>
>>
>> On Dec 11, 2008, at 12:05 PM, Philip Peartree wrote:
>>
>>> Is it possible it could be an ssh problem, since:
>>>
>>> a) that is how internode communication is handled
>>> b) there seems to be nothing showing up on the pbs_mom logs on the nodes
>>> c) the problem I had fixed was to do with that
>>>
>>> Any ideas guys?
>>>
>>>
>>> Quoting "Steve Young" <chemadm at hamilton.edu>:
>>>
>>>> Hmm... I'm not sure... was hoping someone else would chime in with some
>>>> idea's too =). Let see if any one else pipes up.
>>>>
>>>> -Steve
>>>>
>>>>
>>>>
>>>> On Dec 11, 2008, at 11:51 AM, Philip Peartree wrote:
>>>>
>>>>> Checknode and pbsnodes -a show the node ok, and releasehold tries
>>>>> to run the job, but it returns to deferred status
>>>>>
>>>>>
>>>>> Quoting "Steve Young" <chemadm at hamilton.edu>:
>>>>>
>>>>>> Hi Phillip,
>>>>>> How about checknode on the node it was trying to run on? Does
>>>>>> it see
>>>>>> the node ok? Or possibly pbsnodes -a <nodename> to make sure that
>>>>>> torque is seeing the node properly? I'm just grasping at straw's
>>>>>> here
>>>>>> =).... if you run releasehold <jobid> does the job run after that?
>>>>>>
>>>>>> -Steve
>>>>>>
>>>>>> On Dec 11, 2008, at 10:48 AM, Philip Peartree wrote:
>>>>>>
>>>>>>> I now have this problem on a different cluster (but again
>>>>>>> running torque and maui)
>>>>>>>
>>>>>>> Checkjob for the job gives:
>>>>>>>
>>>>>>> State: Idle EState: Deferred
>>>>>>> Creds: user:mcdiypp2 group:nmrc class:med_12h qos:DEFAULT
>>>>>>> WallTime: 00:00:00 of 6:00:00
>>>>>>> SubmitTime: Thu Dec 11 15:24:45
>>>>>>> (Time Queued Total: 00:19:55 Eligible: 00:00:01)
>>>>>>>
>>>>>>> StartDate: -00:19:53 Thu Dec 11 15:24:47
>>>>>>> Total Tasks: 32
>>>>>>>
>>>>>>> Req[0] TaskCount: 32 Partition: ALL
>>>>>>> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
>>>>>>> Opsys: [NONE] Arch: [NONE] Features: [NONE]
>>>>>>>
>>>>>>>
>>>>>>> IWD: [NONE] Executable: [NONE]
>>>>>>> Bypass: 0 StartCount: 1
>>>>>>> PartitionMask: [ALL]
>>>>>>> Flags: RESTARTABLE
>>>>>>>
>>>>>>> job is deferred. Reason: RMFailure (cannot start job - RM
>>>>>>> failure, rc: 15041, msg: 'Execution server rejected request
>>>>>>> MSG=cannot send job to mom, state=PRERUN')
>>>>>>> Holds: Defer (hold reason: RMFailure)
>>>>>>> PE: 32.00 StartPriority: 1
>>>>>>> cannot select job 157 for partition DEFAULT (job hold active)
>>>>>>>
>>>>>>> Having looked this up on google, it says it might be a torque
>>>>>>> problem, but the basic problem (as I see it) seems to be that
>>>>>>> two jobs are assigned to the same set of processors/nodes, and
>>>>>>> I thought that this is the job of maui. This has happened
>>>>>>> previously, and resolved itself (admittedly while another
>>>>>>> problem was being sorted)
>>>>>>>
>>>>>>> I have checked the logs on the nodes affected and there is
>>>>>>> nothing to say if it even got the job at all!!!
>>>>>>>
>>>>>>>
>>>>>>> Quoting "Steve Young" <chemadm at hamilton.edu>:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>> I was looking at the maui manual at:
>>>>>>>>
>>>>>>>> http://www.clusterresources.com/products/maui/docs/11.1jobholds.shtml
>>>>>>>>
>>>>>>>>
>>>>>>>> What does checkjob tell you for that job?
>>>>>>>>
>>>>>>>> -Steve
>>>>>>>>
>>>>>>>> On Dec 11, 2008, at 9:40 AM, Philip Peartree wrote:
>>>>>>>>
>>>>>>>>> Does anyone have any ideas?
>>>>>>>>>
>>>>>>>>> Quoting "Philip Peartree" <P.Peartree at postgrad.manchester.ac.uk>:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I'm having a problem with a torque/maui setup (hence the mail
>>>>>>>>>> to both
>>>>>>>>>> lists). Submitted jobs are being deferred, and this primarily
>>>>>>>>>> seems to
>>>>>>>>>> be because they're all requesting the same resource (node24 at
>>>>>>>>>> this
>>>>>>>>>> point). A qrun seems to shift them onto a correct node.
>>>>>>>>>>
>>>>>>>>>> My pbs_server log suggests that it's being rejected by the
>>>>>>>>>> mom, and a
>>>>>>>>>> look at the logs on the mom shows a rejection going on with
>>>>>>>>>> code 15004
>>>>>>>>>> and the job in unexpected state TRANSICM
>>>>>>>>>>
>>>>>>>>>> Can anyone help?
>>>>>>>>>>
>>>>>>>>>> Phil Peartree
>>>>>>>>>> University of Manchester
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> mauiusers mailing list
>>>>>>>>>> mauiusers at supercluster.org
>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> mauiusers mailing list
>>>>>>>>> mauiusers at supercluster.org
>>>>>>>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> mauiusers mailing list
>>>>>>>> mauiusers at supercluster.org
>>>>>>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> mauiusers mailing list
>>>>>>> mauiusers at supercluster.org
>>>>>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>>>>
>>>>>> _______________________________________________
>>>>>> mauiusers mailing list
>>>>>> mauiusers at supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list