[torqueusers] Post Job File Processing Error (using: TORQUE v2.0.0p8 from ATrpms under Fedora Core 3)

Garrick Staples garrick at usc.edu
Thu Apr 13 13:01:40 MDT 2006


On Thu, Apr 13, 2006 at 02:56:32PM +0200, Alexander Papaspyrou alleged:
> Hi all,
> 
> after doing a fresh installation and configuration of TORQUE, I keep
> getting the following error message for a simple test job:
> 
> --8<-- snip --8<--
> PBS Job Id: [id].torque-srv.cluster.local
> Job Name:   hostname
> An error has occurred processing your job, see below.
> Post job file processing error; job [id].torque-srv.cluster.local on
> host torque-mom01
> 
> Unable to copy file /var/spool/pbs/spool/[id].torque-s.OU to
> /home/jdoe/hostname.out
> -->8-- snap -->8--
> 
> The test job looks like
> 
> --8<-- snip --8<--
> #PBS -o torque-srv:/home/jdoe/hostname.out
> #PBS -o torque-srv:/home/jdoe/hostname.err
> /bin/hostname
> -->8-- snap -->8--

FYI, you have -o twice.

 
> The MOM configuration file looks for all "torque-mom[num]" hosts like this:
> 
> --8<-- snip --8<--
> $clienthost torque-srv.cluster.local
> $usecp *:/home /home
> -->8-- snap -->8--

That looks fine.

 
> A sample tracejob output looks like this:
> 
> --8<-- snip --8<--
> Job: [id].torque-srv.cluster.local
> 
> 04/13/2006 14:07:44  S    enqueuing into default, state 1 hop 1
> 04/13/2006 14:07:44  S    Job Queued at request of
> user at torque-srv.cluster.local, owner = user at torque-srv.cluster.local,
> job name = hostname, queue = default
> 04/13/2006 14:07:44  S    Job Modified at request of
> Scheduler at torque-srv.cluster.local
> 04/13/2006 14:07:44  S    Job Run at request of
> Scheduler at torque-srv.cluster.local
> 04/13/2006 14:07:44  A    queue=default
> 04/13/2006 14:07:45  L    Job Run
> 04/13/2006 14:07:45  A    user=jdoe group=users jobname=hostname
> queue=default ctime=1144930064 qtime=1144930064 etime=1144930064
> start=1144930065 exec_host=torque-mom01
> 04/13/2006 14:07:46  S    Exit_status=0 resources_used.cput=00:00:00
> resources_used.mem=0kb resources_used.vmem=0kb
> resources_used.walltime=00:00:02
> 04/13/2006 14:07:46  A    user=jdoe group=users jobname=hostname
> queue=default ctime=1144930064 qtime=1144930064 etime=1144930064
> start=1144930065 exec_host=torque-mom01 session=25636 end=1144930066
> Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb
> resources_used.vmem=0kb resources_used.walltime=00:00:02
> 04/13/2006 14:07:50  S    Post job file processing error
> 04/13/2006 14:07:50  S    dequeuing from default, state COMPLETE
> -->8-- snap -->8--
> 
> The /home filesystem is a NFS export available on both the server node
> as well as on all MOMs. User management is done in a unified way via LDAP.
> 
> I have absolutely no clue why std[out|err] copying does not work. Using
> -k (aka keeping stuff on the execution host) works perfectly, but that's
> not an option for me. Any ideas?

Check syslog on the node.

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060413/44bc32b8/attachment.bin


More information about the torqueusers mailing list