[torqueusers] Post Job File Processing Error (using: TORQUE
v2.0.0p8 from ATrpms under Fedora Core 3)
Alexander Papaspyrou
alexander.papaspyrou at udo.edu
Thu Apr 13 06:56:32 MDT 2006
Hi all,
after doing a fresh installation and configuration of TORQUE, I keep
getting the following error message for a simple test job:
--8<-- snip --8<--
PBS Job Id: [id].torque-srv.cluster.local
Job Name: hostname
An error has occurred processing your job, see below.
Post job file processing error; job [id].torque-srv.cluster.local on
host torque-mom01
Unable to copy file /var/spool/pbs/spool/[id].torque-s.OU to
/home/jdoe/hostname.out
-->8-- snap -->8--
The test job looks like
--8<-- snip --8<--
#PBS -o torque-srv:/home/jdoe/hostname.out
#PBS -o torque-srv:/home/jdoe/hostname.err
/bin/hostname
-->8-- snap -->8--
The MOM configuration file looks for all "torque-mom[num]" hosts like this:
--8<-- snip --8<--
$clienthost torque-srv.cluster.local
$usecp *:/home /home
-->8-- snap -->8--
A sample tracejob output looks like this:
--8<-- snip --8<--
Job: [id].torque-srv.cluster.local
04/13/2006 14:07:44 S enqueuing into default, state 1 hop 1
04/13/2006 14:07:44 S Job Queued at request of
user at torque-srv.cluster.local, owner = user at torque-srv.cluster.local,
job name = hostname, queue = default
04/13/2006 14:07:44 S Job Modified at request of
Scheduler at torque-srv.cluster.local
04/13/2006 14:07:44 S Job Run at request of
Scheduler at torque-srv.cluster.local
04/13/2006 14:07:44 A queue=default
04/13/2006 14:07:45 L Job Run
04/13/2006 14:07:45 A user=jdoe group=users jobname=hostname
queue=default ctime=1144930064 qtime=1144930064 etime=1144930064
start=1144930065 exec_host=torque-mom01
04/13/2006 14:07:46 S Exit_status=0 resources_used.cput=00:00:00
resources_used.mem=0kb resources_used.vmem=0kb
resources_used.walltime=00:00:02
04/13/2006 14:07:46 A user=jdoe group=users jobname=hostname
queue=default ctime=1144930064 qtime=1144930064 etime=1144930064
start=1144930065 exec_host=torque-mom01 session=25636 end=1144930066
Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb
resources_used.vmem=0kb resources_used.walltime=00:00:02
04/13/2006 14:07:50 S Post job file processing error
04/13/2006 14:07:50 S dequeuing from default, state COMPLETE
-->8-- snap -->8--
The /home filesystem is a NFS export available on both the server node
as well as on all MOMs. User management is done in a unified way via LDAP.
I have absolutely no clue why std[out|err] copying does not work. Using
-k (aka keeping stuff on the execution host) works perfectly, but that's
not an option for me. Any ideas?
Thanks in advance,
--
Dipl.-Inform. Alexander Papaspyrou | 44221 Dortmund, NRW (Germany)
Robotics Research Institute | phone : +49(231)755-5058
Information Technology Section | fax : +49(231)755-3251
University of Dortmund | web : http://www.irf.de/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: OpenPGP digital signature
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060413/ea129624/signature-0001.bin
More information about the torqueusers
mailing list