[Mauiusers] Problems with routing queues?

Stephen Childs Stephen.Childs at cs.tcd.ie
Tue May 2 03:20:13 MDT 2006


At two of our sites that use PBS routing queues, I have a problem where 
large numbers of jobs end up in the 'W' state. As you can see below, they 
have been assigned to a compute node (and from what I remember it briefly 
enters state 'R' on the node), but then fails and gets stuck. Anyone seen 
this behaviour or got any suggestions?

Stephen


[root at gridgate root]# rpm -q maui torque
maui-3.2.6p13-5_SL30X
torque-2.0.0p7-1.sl3.st



[root at gridgate root]# qstat -n 9270

gridgate.cp.dias.ie:
                                                                    Req'd 
  Req'd   Elap
Job ID               Username Queue    Jobname    SessID NDS   TSK Memory 
Time  S Time
-------------------- -------- -------- ---------- ------ ----- --- ------ 
----- - -----
9270.gridgate.cp.dia cosmo003 cosmo    STDIN         --      1  --    -- 
24:00 W   --
    gridwn04

[root at gridgate root]# checkjob 9270
ERROR:    'checkjob' failed
ERROR:  cannot locate job '9270'

[root at gridgate root]# grep 9270 /var/log/maui.log|tail
05/02 10:13:13 WARNING:  job '9270.gridgate.cp.dias.ie' detected with 
unexpected state '11'
05/02 10:13:24 WARNING:  job '9270.gridgate.cp.dias.ie' detected with 
unexpected state '11'
05/02 10:13:35 WARNING:  job '9270.gridgate.cp.dias.ie' detected with 
unexpected state '11'
05/02 10:13:46 WARNING:  job '9270.gridgate.cp.dias.ie' detected with 
unexpected state '11'
05/02 10:13:57 WARNING:  job '9270.gridgate.cp.dias.ie' detected with 
unexpected state '11'
05/02 10:14:08 WARNING:  job '9270.gridgate.cp.dias.ie' detected with 
unexpected state '11'
05/02 10:14:19 WARNING:  job '9270.gridgate.cp.dias.ie' detected with 
unexpected state '11'
05/02 10:14:30 WARNING:  job '9270.gridgate.cp.dias.ie' detected with 
unexpected state '11'
05/02 10:14:41 WARNING:  job '9270.gridgate.cp.dias.ie' detected with 
unexpected state '11'
05/02 10:14:52 WARNING:  job '9270.gridgate.cp.dias.ie' detected with 
unexpected state '11'
[root at gridgate root]#


05/02/2006 09:54:54;0040;PBS_Server;Svr;gridgate.cp.dias.ie;Scheduler sent 
command new
05/02/2006 09:54:55;0100;PBS_Server;Req;;Type StatusNode request received 
from root at gridgate.cp.dias.ie, sock=9
05/02/2006 09:54:55;0100;PBS_Server;Req;;Type StatusQueue request received 
from root at gridgate.cp.dias.ie, sock=9
05/02/2006 09:54:55;0100;PBS_Server;Req;;Type StatusJob request received 
from root at gridgate.cp.dias.ie, sock=9
05/02/2006 09:54:55;0100;PBS_Server;Req;;Type ModifyJob request received 
from root at gridgate.cp.dias.ie, sock=9
05/02/2006 09:54:55;0008;PBS_Server;Job;9270.gridgate.cp.dias.ie;Job 
Modified at request of root at gridgate.cp.dias.ie
05/02/2006 09:54:55;0100;PBS_Server;Req;;Type RunJob request received from 
root at gridgate.cp.dias.ie, sock=9
05/02/2006 09:54:55;0008;PBS_Server;Job;9270.gridgate.cp.dias.ie;Job Run 
at request of root at gridgate.cp.dias.ie
05/02/2006 09:54:55;0100;PBS_Server;Req;;Type ModifyJob request received 
from root at gridgate.cp.dias.ie, sock=9
05/02/2006 09:54:55;0008;PBS_Server;Job;9270.gridgate.cp.dias.ie;Job 
Modified at request of root at gridgate.cp.dias.ie
05/02/2006 09:54:55;0008;PBS_Server;Job;9270.gridgate.cp.dias.ie;MOM 
rejected modify request, error: 15001
05/02/2006 09:54:55;0080;PBS_Server;Req;req_reject;Reject reply 
code=15001(Unknown Job Id), aux=0, type=ModifyJob, from 
root at gridgate.cp.dias.ie



-- 
Dr. Stephen Childs,
Research Fellow, EGEE Project,    phone:                    +353-1-6081797
Computer Architecture Group,      email:        Stephen.Childs @ cs.tcd.ie
Trinity College Dublin, Ireland   web: http://www.cs.tcd.ie/Stephen.Childs


More information about the mauiusers mailing list