[torqueusers] Re: [Mauiusers] Deferred jobs

Philip Peartree P.Peartree at postgrad.manchester.ac.uk
Thu Dec 11 13:50:45 MST 2008


Here you go:

Host: node14/node14   Version: 2.3.3   PID: 3411
Server[0]: steel (10.0.0.254:15001)
   Init Msgs Received:     0 hellos/12666 cluster-addrs
   Init Msgs Sent:         12666 hellos
   Last Msg From Server:   0 seconds (CLUSTER_ADDRS)
   Last Msg To Server:     0 seconds
HomeDirectory:          /var/spool/torque/mom_priv
ALERT:  stdout/stderr spool directory '/var/spool/torque/spool/' is full
NOTE:  syslog enabled
HomeDirectory:          /var/spool/torque/mom_priv
MOM active:             1139850 seconds
Check Poll Time:        45 seconds
Server Update Interval: 45 seconds
LogLevel:               0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model:    RPP
MemLocked:              TRUE  (mlock)
TCP Timeout:            20 seconds
Prolog:                 /var/spool/torque/mom_priv/prologue (disabled)
Alarm Time:             0 of 10 seconds
Trusted Client List:     
10.0.0.13,10.0.0.12,10.0.0.11,10.0.0.10,10.0.0.9,10.0.0.8,10.0.0.7,10.0.0.6,10.0.0.5,10.0.0.4,10.0.0.3,10.0.0.2,10.0.0.1,10.0.0.18,10.0.0.17,10.0.0.16,10.0.0.15,10.0.0.254,10.0.0.14,127.0.0.1
Copy Command:           /usr/bin/scp -rpB
NOTE:  no local jobs detected

diagnostics complete

the job I'm qrun-ing is 158

The problem I have noticed is that either maui or torque seems to be  
sending all jobs to the same set of nodes, and running qrun seems to  
re-allocate some to other nodes but runs out of unaffected nodes





Quoting "Josh Butikofer" <josh at clusterresources.com>:

> Philip,
>
> What node is the job trying to run on when it gets this error message?
> Also, is the job you are trying to qrun named 158? If not, then I
> suspect that a job 158 is clogging successful runs on that pbs_mom.
>
> Run a "momctl -d 3 -h <NODENAME>" on that node and send us the output.
> This will tell us what that pbs_mom believes the status quo is.
>
> Josh Butikofer
> Cluster Resources, Inc.
> #############################
>
>
> Philip Peartree wrote:
>> It was working yesterday, but when I came to run some more jobs,   
>> they wouldn't go.
>>
>> I just tried qrunning again and I get this error:
>>
>> qrun: Execution server rejected request MSG=cannot send job to mom,  
>>  state=PRERUN 158.steel.mib.man.ac.uk
>>
>> I've reported this error to the support contact at the manufacturer  
>>  (who did the initial install) so we'll wait and see what comes  
>> from  that!
>>
>>
>> Quoting "Steve Young" <chemadm at hamilton.edu>:
>>
>>> think the cc got mixed up not sure if it made it to the list.
>>>
>>> It could be possible ssh is a problem. However, if it were I'd think it
>>> would be consistently not working. It sounds like sometimes it does
>>> work?
>>>
>>> -Steve
>>>
>>>
>>> On Dec 11, 2008, at 12:05 PM, Philip Peartree wrote:
>>>
>>>> Is it possible it could be an ssh problem, since:
>>>>
>>>> a) that is how internode communication is handled
>>>> b) there seems to be nothing showing up on the pbs_mom logs on the nodes
>>>> c) the problem I had fixed was to do with that
>>>>
>>>> Any ideas guys?
>>>>
>>>>
>>>> Quoting "Steve Young" <chemadm at hamilton.edu>:
>>>>
>>>>> Hmm... I'm not sure... was hoping someone else would chime in with some
>>>>> idea's too =). Let see if any one else pipes up.
>>>>>
>>>>> -Steve
>>>>>
>>>>>
>>>>>
>>>>> On Dec 11, 2008, at 11:51 AM, Philip Peartree wrote:
>>>>>
>>>>>> Checknode and pbsnodes -a show the node ok, and releasehold   
>>>>>> tries  to run the job, but it returns to deferred status
>>>>>>
>>>>>>
>>>>>> Quoting "Steve Young" <chemadm at hamilton.edu>:
>>>>>>
>>>>>>> Hi Phillip,
>>>>>>>    How about checknode on the node it was trying to run on? Does it see
>>>>>>> the node ok? Or possibly pbsnodes -a <nodename> to make sure that
>>>>>>> torque is seeing the node properly?  I'm just grasping at straw's here
>>>>>>> =).... if you run releasehold <jobid>  does the job run after that?
>>>>>>>
>>>>>>> -Steve
>>>>>>>
>>>>>>> On Dec 11, 2008, at 10:48 AM, Philip Peartree wrote:
>>>>>>>
>>>>>>>> I now have this problem on a different cluster (but again    
>>>>>>>> running  torque and maui)
>>>>>>>>
>>>>>>>> Checkjob for the job gives:
>>>>>>>>
>>>>>>>> State: Idle  EState: Deferred
>>>>>>>> Creds:  user:mcdiypp2  group:nmrc  class:med_12h  qos:DEFAULT
>>>>>>>> WallTime: 00:00:00 of 6:00:00
>>>>>>>> SubmitTime: Thu Dec 11 15:24:45
>>>>>>>> (Time Queued  Total: 00:19:55  Eligible: 00:00:01)
>>>>>>>>
>>>>>>>> StartDate: -00:19:53  Thu Dec 11 15:24:47
>>>>>>>> Total Tasks: 32
>>>>>>>>
>>>>>>>> Req[0]  TaskCount: 32  Partition: ALL
>>>>>>>> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
>>>>>>>> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
>>>>>>>>
>>>>>>>>
>>>>>>>> IWD: [NONE]  Executable:  [NONE]
>>>>>>>> Bypass: 0  StartCount: 1
>>>>>>>> PartitionMask: [ALL]
>>>>>>>> Flags:       RESTARTABLE
>>>>>>>>
>>>>>>>> job is deferred.  Reason:  RMFailure  (cannot start job - RM   
>>>>>>>>    failure, rc: 15041, msg: 'Execution server rejected  
>>>>>>>> request     MSG=cannot send job to mom, state=PRERUN')
>>>>>>>> Holds:    Defer  (hold reason:  RMFailure)
>>>>>>>> PE:  32.00  StartPriority:  1
>>>>>>>> cannot select job 157 for partition DEFAULT (job hold active)
>>>>>>>>
>>>>>>>> Having looked this up on google, it says it might be a torque  
>>>>>>>>     problem, but the basic problem (as I see it) seems to be   
>>>>>>>> that  two  jobs are assigned to the same set of   
>>>>>>>> processors/nodes, and  I   thought that this is the job of   
>>>>>>>> maui. This has happened   previously,  and resolved itself   
>>>>>>>> (admittedly while another   problem was being  sorted)
>>>>>>>>
>>>>>>>> I have checked the logs on the nodes affected and there is     
>>>>>>>> nothing  to say if it even got the job at all!!!
>>>>>>>>
>>>>>>>>
>>>>>>>> Quoting "Steve Young" <chemadm at hamilton.edu>:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>    I was looking at the maui manual at:
>>>>>>>>>
>>>>>>>>> http://www.clusterresources.com/products/maui/docs/11.1jobholds.shtml What does checkjob tell you for that   
>>>>>>>>> job?
>>>>>>>>>
>>>>>>>>> -Steve
>>>>>>>>>
>>>>>>>>> On Dec 11, 2008, at 9:40 AM, Philip Peartree wrote:
>>>>>>>>>
>>>>>>>>>> Does anyone have any ideas?
>>>>>>>>>>
>>>>>>>>>> Quoting "Philip Peartree" <P.Peartree at postgrad.manchester.ac.uk>:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I'm having a problem with a torque/maui setup (hence the   
>>>>>>>>>>> mail to both
>>>>>>>>>>> lists). Submitted jobs are being deferred, and this    
>>>>>>>>>>> primarily seems to
>>>>>>>>>>> be because they're all requesting the same resource (node24 at this
>>>>>>>>>>> point). A qrun seems to shift them onto a correct node.
>>>>>>>>>>>
>>>>>>>>>>> My pbs_server log suggests that it's being rejected by the  
>>>>>>>>>>>  mom, and a
>>>>>>>>>>> look at the logs on the mom shows a rejection going on   
>>>>>>>>>>> with  code 15004
>>>>>>>>>>> and the job in unexpected state TRANSICM
>>>>>>>>>>>
>>>>>>>>>>> Can anyone help?
>>>>>>>>>>>
>>>>>>>>>>> Phil Peartree
>>>>>>>>>>> University of Manchester
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> mauiusers mailing list
>>>>>>>>>>> mauiusers at supercluster.org
>>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> mauiusers mailing list
>>>>>>>>>> mauiusers at supercluster.org
>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> mauiusers mailing list
>>>>>>>>> mauiusers at supercluster.org
>>>>>>>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> mauiusers mailing list
>>>>>>>> mauiusers at supercluster.org
>>>>>>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> mauiusers mailing list
>>>>>>> mauiusers at supercluster.org
>>>>>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>




More information about the torqueusers mailing list