[torqueusers] Post Job File Processing Error (using: TORQUE v2.0.0p8 from ATrpms under Fedora Core 3)

Alexander Papaspyrou alexander.papaspyrou at udo.edu
Thu Apr 13 06:56:32 MDT 2006


Hi all,

after doing a fresh installation and configuration of TORQUE, I keep
getting the following error message for a simple test job:

--8<-- snip --8<--
PBS Job Id: [id].torque-srv.cluster.local
Job Name:   hostname
An error has occurred processing your job, see below.
Post job file processing error; job [id].torque-srv.cluster.local on
host torque-mom01

Unable to copy file /var/spool/pbs/spool/[id].torque-s.OU to
/home/jdoe/hostname.out
-->8-- snap -->8--

The test job looks like

--8<-- snip --8<--
#PBS -o torque-srv:/home/jdoe/hostname.out
#PBS -o torque-srv:/home/jdoe/hostname.err
/bin/hostname
-->8-- snap -->8--

The MOM configuration file looks for all "torque-mom[num]" hosts like this:

--8<-- snip --8<--
$clienthost torque-srv.cluster.local
$usecp *:/home /home
-->8-- snap -->8--

A sample tracejob output looks like this:

--8<-- snip --8<--
Job: [id].torque-srv.cluster.local

04/13/2006 14:07:44  S    enqueuing into default, state 1 hop 1
04/13/2006 14:07:44  S    Job Queued at request of
user at torque-srv.cluster.local, owner = user at torque-srv.cluster.local,
job name = hostname, queue = default
04/13/2006 14:07:44  S    Job Modified at request of
Scheduler at torque-srv.cluster.local
04/13/2006 14:07:44  S    Job Run at request of
Scheduler at torque-srv.cluster.local
04/13/2006 14:07:44  A    queue=default
04/13/2006 14:07:45  L    Job Run
04/13/2006 14:07:45  A    user=jdoe group=users jobname=hostname
queue=default ctime=1144930064 qtime=1144930064 etime=1144930064
start=1144930065 exec_host=torque-mom01
04/13/2006 14:07:46  S    Exit_status=0 resources_used.cput=00:00:00
resources_used.mem=0kb resources_used.vmem=0kb
resources_used.walltime=00:00:02
04/13/2006 14:07:46  A    user=jdoe group=users jobname=hostname
queue=default ctime=1144930064 qtime=1144930064 etime=1144930064
start=1144930065 exec_host=torque-mom01 session=25636 end=1144930066
Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb
resources_used.vmem=0kb resources_used.walltime=00:00:02
04/13/2006 14:07:50  S    Post job file processing error
04/13/2006 14:07:50  S    dequeuing from default, state COMPLETE
-->8-- snap -->8--

The /home filesystem is a NFS export available on both the server node
as well as on all MOMs. User management is done in a unified way via LDAP.

I have absolutely no clue why std[out|err] copying does not work. Using
-k (aka keeping stuff on the execution host) works perfectly, but that's
not an option for me. Any ideas?

Thanks in advance,

-- 
Dipl.-Inform. Alexander Papaspyrou      | 44221 Dortmund, NRW (Germany)
Robotics Research Institute             | phone  : +49(231)755-5058
Information Technology Section          | fax    : +49(231)755-3251
University of Dortmund                  | web    : http://www.irf.de/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: OpenPGP digital signature
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060413/ea129624/signature-0001.bin


More information about the torqueusers mailing list