[torqueusers] Re: [Mauiusers] Deferred jobs

Josh Butikofer josh at clusterresources.com
Thu Dec 11 14:43:22 MST 2008


Philip,

See comments below:

Philip Peartree wrote:
> Here you go:
> 
> Host: node14/node14   Version: 2.3.3   PID: 3411
> Server[0]: steel (10.0.0.254:15001)
>   Init Msgs Received:     0 hellos/12666 cluster-addrs
>   Init Msgs Sent:         12666 hellos
>   Last Msg From Server:   0 seconds (CLUSTER_ADDRS)
>   Last Msg To Server:     0 seconds
> HomeDirectory:          /var/spool/torque/mom_priv
> ALERT:  stdout/stderr spool directory '/var/spool/torque/spool/' is full
> NOTE:  syslog enabled
> HomeDirectory:          /var/spool/torque/mom_priv
> MOM active:             1139850 seconds
> Check Poll Time:        45 seconds
> Server Update Interval: 45 seconds
> LogLevel:               0 (use SIGUSR1/SIGUSR2 to adjust)
> Communication Model:    RPP
> MemLocked:              TRUE  (mlock)
> TCP Timeout:            20 seconds
> Prolog:                 /var/spool/torque/mom_priv/prologue (disabled)
> Alarm Time:             0 of 10 seconds
> Trusted Client List:    
> 10.0.0.13,10.0.0.12,10.0.0.11,10.0.0.10,10.0.0.9,10.0.0.8,10.0.0.7,10.0.0.6,10.0.0.5,10.0.0.4,10.0.0.3,10.0.0.2,10.0.0.1,10.0.0.18,10.0.0.17,10.0.0.16,10.0.0.15,10.0.0.254,10.0.0.14,127.0.0.1 
> 
> Copy Command:           /usr/bin/scp -rpB
> NOTE:  no local jobs detected
> 
> diagnostics complete
> 
> the job I'm qrun-ing is 158
> 
> The problem I have noticed is that either maui or torque seems to be 
> sending all jobs to the same set of nodes, and running qrun seems to 
> re-allocate some to other nodes but runs out of unaffected nodes

Yeah, Maui usually selects nodes in the same order, but if a node 
doesn't work out it will send the job elsewhere. If the "bad" nodes are 
the only free nodes left, then Maui will consistently try to send jobs 
to those nodes and result in the problems you are seeing.

I see the following from the above output:

ALERT:  stdout/stderr spool directory '/var/spool/torque/spool/' is full

Not sure if this could be causing the error you are seeing, but it is 
worth a look. Also, have you looked in the syslog to see if TORQUE is 
logging any more details in there?


More information about the torqueusers mailing list