[torqueusers] output staying on nodes - pbs_mom problem ?
ckirby3 at colsa.com
Tue Sep 20 07:59:37 MDT 2005
Thanks for the assistance Ronny. I have been using the --with-scp and not
the --use-scp flag. I don't believe --use-scp exists. I also have been
using the $usecp statement in my mom config with much success. I have only
seen this problem start with 1.2.0p5.
As I said the problem is intermittent. I have played with the $timeout
value for the mom config increasing it to 120 seconds thinking I might have
a timeout problem but that didn't help.
I may use the following technique outlined in the Torque v2.0 Admin Manual
since my cluster is very large,
"On large, slow, and/or heavily loaded systems, it may be desirable to
increase the pbs_tcp_timeout setting used by the pbs_mom daemon in
mom-to-mom communication. (NOTE: A system may be heavily loaded if it
reports multiple 'End of File from addr' or 'Premature end of message'
failures in the pbs_mom or pbs_server logs) This setting defaults to 20
seconds and requires rebuilding code to adjust. For client-server based
communication, this attribute can be set using the qmgr command. For
mom-to-mom communication, a source code modification is required. To make
this change, edit the $TORQUEBUILDDIR/src/lib/Libifl/tcp_dis.c file and set
pbs_tcp_timeout to the desired maximum number of seconds allowed for a
mom-to-mom request to be serviced."
However, most of those messages were eliminated with the 1.2.0p5 upgrade.
From: Ronny T. Lampert [mailto:telecaadmin at uni.de]
Sent: Tuesday, September 20, 2005 2:46 AM
To: Clifton Kirby
Cc: HPC.admin at uea.ac.uk; torqueusers at supercluster.org
Subject: Re: [torqueusers] output staying on nodes - pbs_mom problem ?
> Running 1.2.0p5, I am having a similar problem with the job staying in the
> "E" state and eventually clearing out reporting the same Post job file
> processing error. However it is intermittent and I am running an Epilogue
> script as well but all processes have completed for the job and the
> script. Sometimes I get the Standard Out file but the Standard Error is
> still in the spool directory on the mother superior. We never saw this
> behavior in 1.2.0p4.
State "E" means the job has finished and the postprocessing is being done;
this includes delivering the output back to the submitter.
The behaviour you mention indicates that the moms can't deliver the output
back; the moms use rcp (an own version, somewhere in the torque-dir) to do
this or can be configured to use scp instead (configure option --use-scp).
If you have networked filesystems, you should use the $usecp directive in
$usecp *.<your>.<domain>:/home /home
Hope that helps?
No virus found in this incoming message.
Checked by AVG Anti-Virus.
Version: 7.0.344 / Virus Database: 267.11.1/104 - Release Date: 9/16/2005
More information about the torqueusers