[torqueusers] Post Job File Processing Error (using: TORQUE
v2.0.0p8 from ATrpms under Fedora Core 3)
Garrick Staples
garrick at usc.edu
Thu Apr 13 13:01:40 MDT 2006
On Thu, Apr 13, 2006 at 02:56:32PM +0200, Alexander Papaspyrou alleged:
> Hi all,
>
> after doing a fresh installation and configuration of TORQUE, I keep
> getting the following error message for a simple test job:
>
> --8<-- snip --8<--
> PBS Job Id: [id].torque-srv.cluster.local
> Job Name: hostname
> An error has occurred processing your job, see below.
> Post job file processing error; job [id].torque-srv.cluster.local on
> host torque-mom01
>
> Unable to copy file /var/spool/pbs/spool/[id].torque-s.OU to
> /home/jdoe/hostname.out
> -->8-- snap -->8--
>
> The test job looks like
>
> --8<-- snip --8<--
> #PBS -o torque-srv:/home/jdoe/hostname.out
> #PBS -o torque-srv:/home/jdoe/hostname.err
> /bin/hostname
> -->8-- snap -->8--
FYI, you have -o twice.
> The MOM configuration file looks for all "torque-mom[num]" hosts like this:
>
> --8<-- snip --8<--
> $clienthost torque-srv.cluster.local
> $usecp *:/home /home
> -->8-- snap -->8--
That looks fine.
> A sample tracejob output looks like this:
>
> --8<-- snip --8<--
> Job: [id].torque-srv.cluster.local
>
> 04/13/2006 14:07:44 S enqueuing into default, state 1 hop 1
> 04/13/2006 14:07:44 S Job Queued at request of
> user at torque-srv.cluster.local, owner = user at torque-srv.cluster.local,
> job name = hostname, queue = default
> 04/13/2006 14:07:44 S Job Modified at request of
> Scheduler at torque-srv.cluster.local
> 04/13/2006 14:07:44 S Job Run at request of
> Scheduler at torque-srv.cluster.local
> 04/13/2006 14:07:44 A queue=default
> 04/13/2006 14:07:45 L Job Run
> 04/13/2006 14:07:45 A user=jdoe group=users jobname=hostname
> queue=default ctime=1144930064 qtime=1144930064 etime=1144930064
> start=1144930065 exec_host=torque-mom01
> 04/13/2006 14:07:46 S Exit_status=0 resources_used.cput=00:00:00
> resources_used.mem=0kb resources_used.vmem=0kb
> resources_used.walltime=00:00:02
> 04/13/2006 14:07:46 A user=jdoe group=users jobname=hostname
> queue=default ctime=1144930064 qtime=1144930064 etime=1144930064
> start=1144930065 exec_host=torque-mom01 session=25636 end=1144930066
> Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb
> resources_used.vmem=0kb resources_used.walltime=00:00:02
> 04/13/2006 14:07:50 S Post job file processing error
> 04/13/2006 14:07:50 S dequeuing from default, state COMPLETE
> -->8-- snap -->8--
>
> The /home filesystem is a NFS export available on both the server node
> as well as on all MOMs. User management is done in a unified way via LDAP.
>
> I have absolutely no clue why std[out|err] copying does not work. Using
> -k (aka keeping stuff on the execution host) works perfectly, but that's
> not an option for me. Any ideas?
Check syslog on the node.
--
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060413/44bc32b8/attachment.bin
More information about the torqueusers
mailing list