[torqueusers] Post job file processing bug in 1.1.0 (patch 1-3)

Roy Dragseth Roy.Dragseth at cc.uit.no
Wed Oct 20 16:18:51 MDT 2004


When I upgraded from 1.0.1p6 to 1.1.0p3 I got problems with the deliverance of 
the stdout and stderr files for my jobs.  The jobs runs but I don't get the 
files containing stdout and stderr.  I get a mail from pbs saying this:

To: royd at paiute.cc.uit.no
Subject: PBS JOB 6.paiute.cc.uit.no
Date: Wed, 20 Oct 2004 23:46:19 +0200 (CEST)
From: adm at paiute.cc.uit.no (root)

PBS Job Id: 6.paiute.cc.uit.no
Job Name:   job.sh
File stage in failed, see below.
Job will be retried later, please investigate and correct problem.
Post job file processing error; job 6.paiute.cc.uit.no on host 
compute-0-0.local/0 REJHOST=compute-0-0.local


everything works fine with v1.0.1p6 and the same setup, but I would like to 
get the latest version into the pbs-roll for rocks as it hopefully fixes the 
nasty job startup looping bug that occurs when one restarts one of the moms 
in the cluster.  One of my users got 30.000 emails saying his job was 
started...

System info:
Distro: Rocks Cluster Distribution v3.3.0 (a clone of RH EL 3.0)
[royd at paiute royd]$ uname -a
Linux paiute.cc.uit.no 2.4.21-20.EL #1 Wed Sep 8 17:45:16 GMT 2004 i686 i686 
i386 GNU/Linux
[royd at paiute royd]$ gcc --version
gcc (GCC) 3.2.3 20030502 (Red Hat Linux 3.2.3-42)



SERVER LOG:

10/20/2004 23:46:18;0100;PBS_Server;Req;;Type authenticateuser request 
received from royd at paiute.cc.uit.no, sock=11
10/20/2004 23:46:18;0100;PBS_Server;Req;;Type queuejob request received from 
royd at paiute.cc.uit.no, sock=10
10/20/2004 23:46:18;0100;PBS_Server;Req;;Type jobscript request received from 
royd at paiute.cc.uit.no, sock=10
10/20/2004 23:46:18;0100;PBS_Server;Req;;Type readytocommit request received 
from royd at paiute.cc.uit.no, sock=10
10/20/2004 23:46:18;0100;PBS_Server;Req;;Type commit request received from 
royd at paiute.cc.uit.no, sock=10
10/20/2004 23:46:18;0100;PBS_Server;Job;6.paiute.cc.uit.no;enqueuing into 
default, state 1 hop 1
10/20/2004 23:46:18;0008;PBS_Server;Job;6.paiute.cc.uit.no;Job Queued at 
request of royd at paiute.cc.uit.no, owner = royd at paiute.cc.uit.no, job name = 
job.sh, queue = default
10/20/2004 23:46:18;0040;PBS_Server;Svr;paiute.cc.uit.no;Scheduler sent 
command scheduler_first
10/20/2004 23:46:19;0100;PBS_Server;Req;;Type disconnect request received from 
maui at paiute.cc.uit.no, sock=9
10/20/2004 23:46:19;0100;PBS_Server;Req;;Type statusqueue request received 
from maui at paiute.cc.uit.no, sock=9
10/20/2004 23:46:19;0100;PBS_Server;Req;;Type statusjob request received from 
maui at paiute.cc.uit.no, sock=9
10/20/2004 23:46:19;0100;PBS_Server;Req;;Type modifyjob request received from 
maui at paiute.cc.uit.no, sock=9
10/20/2004 23:46:19;0008;PBS_Server;Job;6.paiute.cc.uit.no;Job Modified at 
request of maui at paiute.cc.uit.no
10/20/2004 23:46:19;0100;PBS_Server;Req;;Type runjob request received from 
maui at paiute.cc.uit.no, sock=9
10/20/2004 23:46:19;0008;PBS_Server;Job;6.paiute.cc.uit.no;Job Run at request 
of maui at paiute.cc.uit.no
10/20/2004 23:46:19;0100;PBS_Server;Req;;Type modifyjob request received from 
maui at paiute.cc.uit.no, sock=9
10/20/2004 23:46:19;0008;PBS_Server;Job;6.paiute.cc.uit.no;Job Modified at 
request of maui at paiute.cc.uit.no
10/20/2004 23:46:19;0100;PBS_Server;Req;;Type movejobfile request received 
from pbs_mom at compute-0-0.local, sock=10
10/20/2004 23:46:19;0010;PBS_Server;Job;6.paiute.cc.uit.no;Exit_status=0 
resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb 
resources_used.walltime=00:00:00
10/20/2004 23:46:19;000d;PBS_Server;Job;6.paiute.cc.uit.no;Post job file 
processing error; job 6.paiute.cc.uit.no on host compute-0-0.local/0
10/20/2004 23:46:19;0100;PBS_Server;Job;6.paiute.cc.uit.no;dequeuing from 
default, state 5
10/20/2004 23:46:19;0040;PBS_Server;Svr;paiute.cc.uit.no;Scheduler sent 
command term


MOM LOG:

10/20/2004 23:46:19;0100;   pbs_mom;Req;;Type queuejob request received from 
PBS_Server at paiute.local, sock=10
10/20/2004 23:46:19;0100;   pbs_mom;Req;;Type jobscript request received from 
PBS_Server at paiute.local, sock=10
10/20/2004 23:46:19;0100;   pbs_mom;Req;;Type readytocommit request received 
from PBS_Server at paiute.local, sock=10
10/20/2004 23:46:19;0100;   pbs_mom;Req;;Type commit request received from 
PBS_Server at paiute.local, sock=10
10/20/2004 23:46:19;0008;   pbs_mom;Job;6.paiute.cc.uit.no;Started, pid = 
14730
10/20/2004 23:46:19;0100;   pbs_mom;Req;;Type statusjob request received from 
PBS_Server at paiute.local, sock=11
10/20/2004 23:46:19;0100;   pbs_mom;Req;;Type modifyjob request received from 
PBS_Server at paiute.local, sock=10
10/20/2004 23:46:19;0008;   pbs_mom;Job;6.paiute.cc.uit.no;Job Modified at 
request of PBS_Server at paiute.local
10/20/2004 23:46:19;0080;   
pbs_mom;Job;6.paiute.cc.uit.no;scan_for_terminated: task 1 terminated, sid 
14730
10/20/2004 23:46:19;0008;   pbs_mom;Job;6.paiute.cc.uit.no;Terminated
10/20/2004 23:46:19;0080;   pbs_mom;Job;6.paiute.cc.uit.no;Obit sent
10/20/2004 23:46:19;0100;   pbs_mom;Req;;Type deletefiles request received 
from PBS_Server at paiute.local, sock=11
10/20/2004 23:46:19;0080;   pbs_mom;Req;req_reject;Reject reply code=15035
( REJHOST=compute-0-0.local), aux=0, type=54, from PBS_Server at paiute.local
10/20/2004 23:46:19;0100;   pbs_mom;Req;;Type deletejob request received from 
PBS_Server at paiute.local, sock=11


Any hints is greatly appreciated.

Best regards,
r.


-- 

  The Computer Center, University of Tromsø, N-9037 TROMSØ, Norway.
	      phone:+47 77 64 41 07, fax:+47 77 64 41 00
     Roy Dragseth, High Performance Computing System Administrator
	 Direct call: +47 77 64 62 56. email: royd at cc.uit.no



More information about the torqueusers mailing list