[torqueusers] Post job file processing bug in 1.1.0 (patch 1-3)
Roy Dragseth
Roy.Dragseth at cc.uit.no
Wed Oct 20 16:18:51 MDT 2004
When I upgraded from 1.0.1p6 to 1.1.0p3 I got problems with the deliverance of
the stdout and stderr files for my jobs. The jobs runs but I don't get the
files containing stdout and stderr. I get a mail from pbs saying this:
To: royd at paiute.cc.uit.no
Subject: PBS JOB 6.paiute.cc.uit.no
Date: Wed, 20 Oct 2004 23:46:19 +0200 (CEST)
From: adm at paiute.cc.uit.no (root)
PBS Job Id: 6.paiute.cc.uit.no
Job Name: job.sh
File stage in failed, see below.
Job will be retried later, please investigate and correct problem.
Post job file processing error; job 6.paiute.cc.uit.no on host
compute-0-0.local/0 REJHOST=compute-0-0.local
everything works fine with v1.0.1p6 and the same setup, but I would like to
get the latest version into the pbs-roll for rocks as it hopefully fixes the
nasty job startup looping bug that occurs when one restarts one of the moms
in the cluster. One of my users got 30.000 emails saying his job was
started...
System info:
Distro: Rocks Cluster Distribution v3.3.0 (a clone of RH EL 3.0)
[royd at paiute royd]$ uname -a
Linux paiute.cc.uit.no 2.4.21-20.EL #1 Wed Sep 8 17:45:16 GMT 2004 i686 i686
i386 GNU/Linux
[royd at paiute royd]$ gcc --version
gcc (GCC) 3.2.3 20030502 (Red Hat Linux 3.2.3-42)
SERVER LOG:
10/20/2004 23:46:18;0100;PBS_Server;Req;;Type authenticateuser request
received from royd at paiute.cc.uit.no, sock=11
10/20/2004 23:46:18;0100;PBS_Server;Req;;Type queuejob request received from
royd at paiute.cc.uit.no, sock=10
10/20/2004 23:46:18;0100;PBS_Server;Req;;Type jobscript request received from
royd at paiute.cc.uit.no, sock=10
10/20/2004 23:46:18;0100;PBS_Server;Req;;Type readytocommit request received
from royd at paiute.cc.uit.no, sock=10
10/20/2004 23:46:18;0100;PBS_Server;Req;;Type commit request received from
royd at paiute.cc.uit.no, sock=10
10/20/2004 23:46:18;0100;PBS_Server;Job;6.paiute.cc.uit.no;enqueuing into
default, state 1 hop 1
10/20/2004 23:46:18;0008;PBS_Server;Job;6.paiute.cc.uit.no;Job Queued at
request of royd at paiute.cc.uit.no, owner = royd at paiute.cc.uit.no, job name =
job.sh, queue = default
10/20/2004 23:46:18;0040;PBS_Server;Svr;paiute.cc.uit.no;Scheduler sent
command scheduler_first
10/20/2004 23:46:19;0100;PBS_Server;Req;;Type disconnect request received from
maui at paiute.cc.uit.no, sock=9
10/20/2004 23:46:19;0100;PBS_Server;Req;;Type statusqueue request received
from maui at paiute.cc.uit.no, sock=9
10/20/2004 23:46:19;0100;PBS_Server;Req;;Type statusjob request received from
maui at paiute.cc.uit.no, sock=9
10/20/2004 23:46:19;0100;PBS_Server;Req;;Type modifyjob request received from
maui at paiute.cc.uit.no, sock=9
10/20/2004 23:46:19;0008;PBS_Server;Job;6.paiute.cc.uit.no;Job Modified at
request of maui at paiute.cc.uit.no
10/20/2004 23:46:19;0100;PBS_Server;Req;;Type runjob request received from
maui at paiute.cc.uit.no, sock=9
10/20/2004 23:46:19;0008;PBS_Server;Job;6.paiute.cc.uit.no;Job Run at request
of maui at paiute.cc.uit.no
10/20/2004 23:46:19;0100;PBS_Server;Req;;Type modifyjob request received from
maui at paiute.cc.uit.no, sock=9
10/20/2004 23:46:19;0008;PBS_Server;Job;6.paiute.cc.uit.no;Job Modified at
request of maui at paiute.cc.uit.no
10/20/2004 23:46:19;0100;PBS_Server;Req;;Type movejobfile request received
from pbs_mom at compute-0-0.local, sock=10
10/20/2004 23:46:19;0010;PBS_Server;Job;6.paiute.cc.uit.no;Exit_status=0
resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb
resources_used.walltime=00:00:00
10/20/2004 23:46:19;000d;PBS_Server;Job;6.paiute.cc.uit.no;Post job file
processing error; job 6.paiute.cc.uit.no on host compute-0-0.local/0
10/20/2004 23:46:19;0100;PBS_Server;Job;6.paiute.cc.uit.no;dequeuing from
default, state 5
10/20/2004 23:46:19;0040;PBS_Server;Svr;paiute.cc.uit.no;Scheduler sent
command term
MOM LOG:
10/20/2004 23:46:19;0100; pbs_mom;Req;;Type queuejob request received from
PBS_Server at paiute.local, sock=10
10/20/2004 23:46:19;0100; pbs_mom;Req;;Type jobscript request received from
PBS_Server at paiute.local, sock=10
10/20/2004 23:46:19;0100; pbs_mom;Req;;Type readytocommit request received
from PBS_Server at paiute.local, sock=10
10/20/2004 23:46:19;0100; pbs_mom;Req;;Type commit request received from
PBS_Server at paiute.local, sock=10
10/20/2004 23:46:19;0008; pbs_mom;Job;6.paiute.cc.uit.no;Started, pid =
14730
10/20/2004 23:46:19;0100; pbs_mom;Req;;Type statusjob request received from
PBS_Server at paiute.local, sock=11
10/20/2004 23:46:19;0100; pbs_mom;Req;;Type modifyjob request received from
PBS_Server at paiute.local, sock=10
10/20/2004 23:46:19;0008; pbs_mom;Job;6.paiute.cc.uit.no;Job Modified at
request of PBS_Server at paiute.local
10/20/2004 23:46:19;0080;
pbs_mom;Job;6.paiute.cc.uit.no;scan_for_terminated: task 1 terminated, sid
14730
10/20/2004 23:46:19;0008; pbs_mom;Job;6.paiute.cc.uit.no;Terminated
10/20/2004 23:46:19;0080; pbs_mom;Job;6.paiute.cc.uit.no;Obit sent
10/20/2004 23:46:19;0100; pbs_mom;Req;;Type deletefiles request received
from PBS_Server at paiute.local, sock=11
10/20/2004 23:46:19;0080; pbs_mom;Req;req_reject;Reject reply code=15035
( REJHOST=compute-0-0.local), aux=0, type=54, from PBS_Server at paiute.local
10/20/2004 23:46:19;0100; pbs_mom;Req;;Type deletejob request received from
PBS_Server at paiute.local, sock=11
Any hints is greatly appreciated.
Best regards,
r.
--
The Computer Center, University of Tromsø, N-9037 TROMSØ, Norway.
phone:+47 77 64 41 07, fax:+47 77 64 41 00
Roy Dragseth, High Performance Computing System Administrator
Direct call: +47 77 64 62 56. email: royd at cc.uit.no
More information about the torqueusers
mailing list