[torqueusers] Re: [Mauiusers] Deferred jobs
Josh Butikofer
josh at clusterresources.com
Thu Dec 11 14:43:22 MST 2008
Philip,
See comments below:
Philip Peartree wrote:
> Here you go:
>
> Host: node14/node14 Version: 2.3.3 PID: 3411
> Server[0]: steel (10.0.0.254:15001)
> Init Msgs Received: 0 hellos/12666 cluster-addrs
> Init Msgs Sent: 12666 hellos
> Last Msg From Server: 0 seconds (CLUSTER_ADDRS)
> Last Msg To Server: 0 seconds
> HomeDirectory: /var/spool/torque/mom_priv
> ALERT: stdout/stderr spool directory '/var/spool/torque/spool/' is full
> NOTE: syslog enabled
> HomeDirectory: /var/spool/torque/mom_priv
> MOM active: 1139850 seconds
> Check Poll Time: 45 seconds
> Server Update Interval: 45 seconds
> LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust)
> Communication Model: RPP
> MemLocked: TRUE (mlock)
> TCP Timeout: 20 seconds
> Prolog: /var/spool/torque/mom_priv/prologue (disabled)
> Alarm Time: 0 of 10 seconds
> Trusted Client List:
> 10.0.0.13,10.0.0.12,10.0.0.11,10.0.0.10,10.0.0.9,10.0.0.8,10.0.0.7,10.0.0.6,10.0.0.5,10.0.0.4,10.0.0.3,10.0.0.2,10.0.0.1,10.0.0.18,10.0.0.17,10.0.0.16,10.0.0.15,10.0.0.254,10.0.0.14,127.0.0.1
>
> Copy Command: /usr/bin/scp -rpB
> NOTE: no local jobs detected
>
> diagnostics complete
>
> the job I'm qrun-ing is 158
>
> The problem I have noticed is that either maui or torque seems to be
> sending all jobs to the same set of nodes, and running qrun seems to
> re-allocate some to other nodes but runs out of unaffected nodes
Yeah, Maui usually selects nodes in the same order, but if a node
doesn't work out it will send the job elsewhere. If the "bad" nodes are
the only free nodes left, then Maui will consistently try to send jobs
to those nodes and result in the problems you are seeing.
I see the following from the above output:
ALERT: stdout/stderr spool directory '/var/spool/torque/spool/' is full
Not sure if this could be causing the error you are seeing, but it is
worth a look. Also, have you looked in the syslog to see if TORQUE is
logging any more details in there?
More information about the torqueusers
mailing list