[torqueusers] Torque configuration issie

Garrick Staples garrick at usc.edu
Tue Jan 10 12:26:39 MST 2006


On Tue, Jan 10, 2006 at 02:12:51PM +0100, Jo De Troy alleged:
> Hello,
> 
> I'm running Torque/Maui on a small cluster (1 headnode + 5 dual CPU nodes)
> running RHEL 3 and I have a few problems.
> Apparantly when a user submits  a bunch of jobs in a row the  ones submitted
> last go in to Queued state and soon afterwards they disappear.
> When looking at these jobs with tracejob  the have an exit_status = -2
> Is this a setting that limits the total number of jobs submitted by one
> user? Or is something else wrong?

Something is failing in the job launch, probably in the prologue or data
stagein.

The user should have gotten an email with an error message about the job
launch failure.  There should also be errors in the MOM log.  And if you
built TORQUE with --enable-syslog, you should see messages in syslog.

 
> Another problem I have is that the jobs that run fine complain via e-mail
> about being unable to copy the OU and the ER file from the spool directory
> on the clusternode back to the homedirectory of the user who submitted the
> job.
> The headnode is NFS exporting the /home to all compute nodes, the headnode
> is dual-homed (2 NICS)
> The /home is mounted via the internal NIC while the error states it's trying
> to copy the ER and OU files via the external NIC.

2 things:  It sounds like the server's hostname resolves to the external
interface instead of the internal interface.  You can set the server
name used by TORQUE to the internal name, see Appendix K of the admin
guide:
http://www.clusterresources.com/products/torque/docs20/torqueadmin.shtml

And you can use $usecp in the MOM config to inform MOM about which
filesystems can be reached "locally" instead of rcp/scp.  See the
'pbs_mom' manpage and section 6.2 of the admin guide.

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060110/f892d77e/attachment.bin


More information about the torqueusers mailing list