[torqueusers] Post job file processing bug in 1.1.0 (patch 1-3)

Dave Jackson jacksond at supercluster.org
Thu Oct 21 11:04:14 MDT 2004


Roy,

  The reply code 15035 inidicates that an invalid home directory was
specified when the mom was attempting to fork to the user.  Torque
1.1.0p3 and higher fixed a bug where the mom would attempt to determine
the home directory based on a NULL job.  This is the only change which
would have affected it.  It appears that this bug masked another.

  The latest torque-1.1.0p4 snapshot contains a re-organized mom-level
fork routine which will log environment errors better and will also
sanity check the home directory.  If you can regularly reproduce this
failure, please test with the 1.1.0p4 snapshot and send us the logs. 
You should only need to upgrade the moms on the nodes where the test job
is being run.  For maximum value, please export the env variable
PBSLOGLEVEL=3 on the compute nodes before starting the mom.

  With this info, we should be able to rectify this problem quickly.

Thanks,
Dave

On Wed, 2004-10-20 at 16:18, Roy Dragseth wrote:
> When I upgraded from 1.0.1p6 to 1.1.0p3 I got problems with the deliverance of 
> the stdout and stderr files for my jobs.  The jobs runs but I don't get the 
> files containing stdout and stderr.  I get a mail from pbs saying this:
> 
> To: royd at paiute.cc.uit.no
> Subject: PBS JOB 6.paiute.cc.uit.no
> Date: Wed, 20 Oct 2004 23:46:19 +0200 (CEST)
> From: adm at paiute.cc.uit.no (root)
> 
> PBS Job Id: 6.paiute.cc.uit.no
> Job Name:   job.sh
> File stage in failed, see below.
> Job will be retried later, please investigate and correct problem.
> Post job file processing error; job 6.paiute.cc.uit.no on host 
> compute-0-0.local/0 REJHOST=compute-0-0.local
> 
> 
> everything works fine with v1.0.1p6 and the same setup, but I would like to 
> get the latest version into the pbs-roll for rocks as it hopefully fixes the 
> nasty job startup looping bug that occurs when one restarts one of the moms 
> in the cluster.  One of my users got 30.000 emails saying his job was 
> started...
> 
> System info:
> Distro: Rocks Cluster Distribution v3.3.0 (a clone of RH EL 3.0)
> [royd at paiute royd]$ uname -a
> Linux paiute.cc.uit.no 2.4.21-20.EL #1 Wed Sep 8 17:45:16 GMT 2004 i686 i686 
> i386 GNU/Linux
> [royd at paiute royd]$ gcc --version
> gcc (GCC) 3.2.3 20030502 (Red Hat Linux 3.2.3-42)
> 
> 
> 
> SERVER LOG:
> 
> 10/20/2004 23:46:18;0100;PBS_Server;Req;;Type authenticateuser request 
> received from royd at paiute.cc.uit.no, sock=11
> 10/20/2004 23:46:18;0100;PBS_Server;Req;;Type queuejob request received from 
> royd at paiute.cc.uit.no, sock=10
> 10/20/2004 23:46:18;0100;PBS_Server;Req;;Type jobscript request received from 
> royd at paiute.cc.uit.no, sock=10
> 10/20/2004 23:46:18;0100;PBS_Server;Req;;Type readytocommit request received 
> from royd at paiute.cc.uit.no, sock=10
> 10/20/2004 23:46:18;0100;PBS_Server;Req;;Type commit request received from 
> royd at paiute.cc.uit.no, sock=10
> 10/20/2004 23:46:18;0100;PBS_Server;Job;6.paiute.cc.uit.no;enqueuing into 
> default, state 1 hop 1
> 10/20/2004 23:46:18;0008;PBS_Server;Job;6.paiute.cc.uit.no;Job Queued at 
> request of royd at paiute.cc.uit.no, owner = royd at paiute.cc.uit.no, job name = 
> job.sh, queue = default
> 10/20/2004 23:46:18;0040;PBS_Server;Svr;paiute.cc.uit.no;Scheduler sent 
> command scheduler_first
> 10/20/2004 23:46:19;0100;PBS_Server;Req;;Type disconnect request received from 
> maui at paiute.cc.uit.no, sock=9
> 10/20/2004 23:46:19;0100;PBS_Server;Req;;Type statusqueue request received 
> from maui at paiute.cc.uit.no, sock=9
> 10/20/2004 23:46:19;0100;PBS_Server;Req;;Type statusjob request received from 
> maui at paiute.cc.uit.no, sock=9
> 10/20/2004 23:46:19;0100;PBS_Server;Req;;Type modifyjob request received from 
> maui at paiute.cc.uit.no, sock=9
> 10/20/2004 23:46:19;0008;PBS_Server;Job;6.paiute.cc.uit.no;Job Modified at 
> request of maui at paiute.cc.uit.no
> 10/20/2004 23:46:19;0100;PBS_Server;Req;;Type runjob request received from 
> maui at paiute.cc.uit.no, sock=9
> 10/20/2004 23:46:19;0008;PBS_Server;Job;6.paiute.cc.uit.no;Job Run at request 
> of maui at paiute.cc.uit.no
> 10/20/2004 23:46:19;0100;PBS_Server;Req;;Type modifyjob request received from 
> maui at paiute.cc.uit.no, sock=9
> 10/20/2004 23:46:19;0008;PBS_Server;Job;6.paiute.cc.uit.no;Job Modified at 
> request of maui at paiute.cc.uit.no
> 10/20/2004 23:46:19;0100;PBS_Server;Req;;Type movejobfile request received 
> from pbs_mom at compute-0-0.local, sock=10
> 10/20/2004 23:46:19;0010;PBS_Server;Job;6.paiute.cc.uit.no;Exit_status=0 
> resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb 
> resources_used.walltime=00:00:00
> 10/20/2004 23:46:19;000d;PBS_Server;Job;6.paiute.cc.uit.no;Post job file 
> processing error; job 6.paiute.cc.uit.no on host compute-0-0.local/0
> 10/20/2004 23:46:19;0100;PBS_Server;Job;6.paiute.cc.uit.no;dequeuing from 
> default, state 5
> 10/20/2004 23:46:19;0040;PBS_Server;Svr;paiute.cc.uit.no;Scheduler sent 
> command term
> 
> 
> MOM LOG:
> 
> 10/20/2004 23:46:19;0100;   pbs_mom;Req;;Type queuejob request received from 
> PBS_Server at paiute.local, sock=10
> 10/20/2004 23:46:19;0100;   pbs_mom;Req;;Type jobscript request received from 
> PBS_Server at paiute.local, sock=10
> 10/20/2004 23:46:19;0100;   pbs_mom;Req;;Type readytocommit request received 
> from PBS_Server at paiute.local, sock=10
> 10/20/2004 23:46:19;0100;   pbs_mom;Req;;Type commit request received from 
> PBS_Server at paiute.local, sock=10
> 10/20/2004 23:46:19;0008;   pbs_mom;Job;6.paiute.cc.uit.no;Started, pid = 
> 14730
> 10/20/2004 23:46:19;0100;   pbs_mom;Req;;Type statusjob request received from 
> PBS_Server at paiute.local, sock=11
> 10/20/2004 23:46:19;0100;   pbs_mom;Req;;Type modifyjob request received from 
> PBS_Server at paiute.local, sock=10
> 10/20/2004 23:46:19;0008;   pbs_mom;Job;6.paiute.cc.uit.no;Job Modified at 
> request of PBS_Server at paiute.local
> 10/20/2004 23:46:19;0080;   
> pbs_mom;Job;6.paiute.cc.uit.no;scan_for_terminated: task 1 terminated, sid 
> 14730
> 10/20/2004 23:46:19;0008;   pbs_mom;Job;6.paiute.cc.uit.no;Terminated
> 10/20/2004 23:46:19;0080;   pbs_mom;Job;6.paiute.cc.uit.no;Obit sent
> 10/20/2004 23:46:19;0100;   pbs_mom;Req;;Type deletefiles request received 
> from PBS_Server at paiute.local, sock=11
> 10/20/2004 23:46:19;0080;   pbs_mom;Req;req_reject;Reject reply code=15035
> ( REJHOST=compute-0-0.local), aux=0, type=54, from PBS_Server at paiute.local
> 10/20/2004 23:46:19;0100;   pbs_mom;Req;;Type deletejob request received from 
> PBS_Server at paiute.local, sock=11
> 
> 
> Any hints is greatly appreciated.
> 
> Best regards,
> r.
> 



More information about the torqueusers mailing list