[torqueusers] mom server communication failings.

Steve Traylen s.traylen at rl.ac.uk
Tue Aug 3 09:30:07 MDT 2004


On Mon, Aug 02, 2004 at 08:53:28PM +0100 or thereabouts, Steve Traylen wrote:
> 
> Hi,
> 
>  I'm using versions.
> 
>  maui-3.2.6p6
>  torque-1.0.1p6
> 
>  After recently increasing the size of our maui/torque farm I've
>  been having more or less constant problems.

Looking at this some more it looks like a job can get into a state that 
it can never get out of.

The exec_host appears to have been set during the first execution attempt
when the pbs_mom rejected. So now the job is queued with an exec_host set


# qstat -f  27346 | grep exec_host
    exec_host = lcg0453.gridpp.rl.ac.uk/1

# qstat -f | grep job_state
    job_state = Q

The node in question though is now busy with other newer jobs.

# qmgr -c 'l n lcg0453.gridpp.rl.ac.uk'
Node lcg0453.gridpp.rl.ac.uk
        state = job-exclusive,busy
        np = 2
        properties = lcgpro
        ntype = cluster
        jobs = 0/27079.lcgce02.gridpp.rl.ac.uk,
               1/27357.lcgce02.gridpp.rl.ac.uk

because the pbs_mom failure was just transient error.

If any one even has a way to unset the exec_host this would help, it all results
in a bad interaction with maui since now this stuck job is there soft limits are
imposed for user and not hard limits. All of the jobs from this user pile up behind this
stuck one.

 Steve

  Steve

> 
>  There appears to break down in the communication between pbs_server
>  and the pbs_moms
> 
>  For instance 
> 
>  + In a maui log, all queued jobs get blocked because of 
> 
>    08/02 20:38:27 ERROR:    job '26673' cannot be started:
>         (rc: 15057  errmsg: 'Cannot execute at specified host because 
>         of checkpoint or stagein files'  hostlist: 'lcg0279.gridpp.rl.ac.uk')
>    08/02 20:38:27 MPBSJobModify(26673,Resource_List,Resource,1)
>    08/02 20:38:27 ERROR:    MBFFirstFit:  cannot start job 26673.0
>    08/02 20:38:27 MRMJobStart(26711,Msg,SC)
>    08/02 20:38:27 MPBSJobStart(26711,base,Msg,SC)
>    08/02 20:38:27 MPBSJobModify(26711,Resource_List,Resource,
>                lcg0279.gridpp.rl.ac.uk)
>    08/02 20:38:30 ERROR:    job '26711' cannot be started: 
>         (rc: 15070  errmsg: 'Server could not connect to MOM'
>          hostlist: 'lcg0279.gridpp.rl.ac.uk')
> 
>  On the corresponding mom 
> 
>  pbs_mom;Svr;pbs_mom;im_eof, Premature end of message
>         from addr 130.246.183.188:15001
> 
>  The node is generally running fine.
> 
> 
>  + Another common observation is the that two job nodes stay in 'busy'
>    state when one of their jobs has finished despite this now creating
>    a new slot.
> 
>  + The pbs_server regulary goes into a loop and fills the CPU.
> 
>  Any suggestions? 
>    
> 
> -- 
> Steve Traylen
> s.traylen at rl.ac.uk
> http://www.gridpp.ac.uk/
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://supercluster.org/mailman/listinfo/torqueusers

-- 
Steve Traylen
s.traylen at rl.ac.uk
http://www.gridpp.ac.uk/


More information about the torqueusers mailing list