[torqueusers] output staying on nodes - pbs_mom problem ?

Clifton Kirby ckirby3 at colsa.com
Tue Sep 20 07:59:37 MDT 2005


Thanks for the assistance Ronny.  I have been using the --with-scp and not
the --use-scp flag.  I don't believe --use-scp exists.  I also have been
using the $usecp statement in my mom config with much success.  I have only
seen this problem start with 1.2.0p5.

As I said the problem is intermittent.  I have played with the $timeout
value for the mom config increasing it to 120 seconds thinking I might have
a timeout problem but that didn't help.

I may use the following technique outlined in the Torque v2.0 Admin Manual
since my cluster is very large,

"On large, slow, and/or heavily loaded systems, it may be desirable to
increase the pbs_tcp_timeout setting used by the pbs_mom daemon in
mom-to-mom communication. (NOTE: A system may be heavily loaded if it
reports multiple 'End of File from addr' or 'Premature end of message'
failures in the pbs_mom or pbs_server logs)  This setting defaults to 20
seconds and requires rebuilding code to adjust.  For client-server based
communication, this attribute can be set using the qmgr command.  For
mom-to-mom communication, a source code modification is required.  To make
this change, edit the $TORQUEBUILDDIR/src/lib/Libifl/tcp_dis.c file and set
pbs_tcp_timeout to the desired maximum number of seconds allowed for a
mom-to-mom request to be serviced."

However, most of those messages were eliminated with the 1.2.0p5 upgrade.

- Cliff

-----Original Message-----
From: Ronny T. Lampert [mailto:telecaadmin at uni.de]
Sent: Tuesday, September 20, 2005 2:46 AM
To: Clifton Kirby
Cc: HPC.admin at uea.ac.uk; torqueusers at supercluster.org
Subject: Re: [torqueusers] output staying on nodes - pbs_mom problem ?


Hi,

> Running 1.2.0p5, I am having a similar problem with the job staying in the
> "E" state and eventually clearing out reporting the same Post job file
> processing error.  However it is intermittent and I am running an Epilogue
> script as well but all processes have completed for the job and the
Epilogue
> script.  Sometimes I get the Standard Out file but the Standard Error is
> still in the spool directory on the mother superior.  We never saw this
> behavior in 1.2.0p4.

State "E" means the job has finished and the postprocessing is being done;
this includes delivering the output back to the submitter.

The behaviour you mention indicates that the moms can't deliver the output
back; the moms use rcp (an own version, somewhere in the torque-dir) to do
this or can be configured to use scp instead (configure option --use-scp).

If you have networked filesystems, you should use the $usecp directive in
mom_priv/config:

$usecp *.<your>.<domain>:/home /home

Hope that helps?

Cheers,
Ronny




--
No virus found in this incoming message.
Checked by AVG Anti-Virus.
Version: 7.0.344 / Virus Database: 267.11.1/104 - Release Date: 9/16/2005




More information about the torqueusers mailing list