[torqueusers] unable to prevent PBS_Server <-> PBS_MOM race

Michael Gutteridge mgutteri at fhcrc.org
Mon Sep 13 20:42:30 MDT 2004


Heya-

I recently started getting this problem, or at least a similar problem. 
  I'm using 3.2.6p9 and torque 1.1.0p0 (recently upgraded).  The good 
news is I think I have a work around that's just slightly easier than 
rebooting the cluster 8-), the bad news is I don't know what the core 
problem is.

In my case, I just needed to "qrerun" the job (make sure rerunnable is 
true!).  You might have to do it a few times to catch it when it's in a 
"run" state.  That seems to get it.

As for the core problem.  I've got logging set to insane (255).  These 
are the messages that pop up when this particular problem-job was run:

09/13/2004 15:21:16;0008;PBS_Server;Job;12200.pbsserv;Dependency on job 
12199.pbsserv released.
09/13/2004 15:21:16;0008;PBS_Server;Job;12201.pbsserv;Dependency on job 
12199.pbsserv released.
09/13/2004 15:21:16;0040;PBS_Server;Svr;pbsserv;Scheduler sent command 
new
09/13/2004 15:21:17;0008;PBS_Server;Job;12200.pbsserv;Job Modified at 
request of root at pbsserv
09/13/2004 15:21:17;0008;PBS_Server;Job;12200.pbsserv;Job Run at 
request of root at pbsserv
09/13/2004 15:21:17;0040;PBS_Server;Svr;pbsserv;Scheduler sent command 
term
09/13/2004 15:21:17;0008;PBS_Server;Job;12200.pbsserv;Job Modified at 
request of root at pbsserv
09/13/2004 15:21:19;0040;PBS_Server;Svr;pbsserv;Scheduler sent command 
new
09/13/2004 15:21:19;0080;PBS_Server;Req;?;Req Header bad, dis error 7
09/13/2004 15:21:19;0080;PBS_Server;Req;req_reject;Reject reply 
code=15056, aux=0, type=0, from @
09/13/2004 15:21:19;0002;PBS_Server;Req;dis_reply_write;DIS reply 
failure, -1
09/13/2004 15:21:19;0009;PBS_Server;Job;12200.pbsserv;Job Obit notice 
received from node10. has error 15016
09/13/2004 15:21:19;0080;PBS_Server;Req;req_reject;Reject reply 
code=15016, aux=0, type=56, from pbs_mom at node10.
09/13/2004 15:21:19;0009;PBS_Server;Job;12200.pbsserv;Job Obit notice 
received from node10. has error 15016
09/13/2004 15:21:19;0080;PBS_Server;Req;req_reject;Reject reply 
code=15016, aux=0, type=56, from pbs_mom at node10.
09/13/2004 15:21:20;0008;PBS_Server;Job;12200.pbsserv;Job Modified at 
request of root at pbsserv
09/13/2004 15:21:20;0008;PBS_Server;Job;12200.pbsserv;Job Run at 
request of root at pbsserv
09/13/2004 15:21:20;0008;PBS_Server;Job;12200.pbsserv;Job Modified at 
request of root at pbsserv
09/13/2004 15:21:23;0040;PBS_Server;Svr;pbsserv;Scheduler sent command 
new
09/13/2004 15:21:23;0009;PBS_Server;Job;12200.pbsserv;Job Obit notice 
received from node10. has error 15016
09/13/2004 15:21:23;0080;PBS_Server;Req;req_reject;Reject reply 
code=15016, aux=0, type=56, from pbs_mom at node10.
09/13/2004 15:21:23;0009;PBS_Server;Job;12200.pbsserv;Job Obit notice 
received from node10. has error 15016
09/13/2004 15:21:23;0080;PBS_Server;Req;req_reject;Reject reply 
code=15016, aux=0, type=56, from pbs_mom at node10.

The MOM in question had errors complaining of a bad prologue/epilogue 
(never saw errors though):

09/13/2004 15:49:53;0008;   pbs_mom;Job;12200.pbsserv;JOIN JOB as node 3
09/13/2004 15:49:56;0100;   pbs_mom;Req;;Type queuejob request received 
from PBS_Server at pbsserv, sock=10
09/13/2004 15:49:56;0100;   pbs_mom;Req;;Type jobscript request 
received from PBS_Server at pbsserv, sock=10
09/13/2004 15:49:56;0100;   pbs_mom;Req;;Type readytocommit request 
received from PBS_Server at pbsserv, sock=10
09/13/2004 15:49:56;0100;   pbs_mom;Req;;Type commit request received 
from PBS_Server at pbsserv, sock=10
09/13/2004 15:49:56;0100;   pbs_mom;Req;;Type statusjob request 
received from PBS_Server at pbsserv, sock=10
09/13/2004 15:49:56;0100;   pbs_mom;Req;;Type modifyjob request 
received from PBS_Server at pbsserv, sock=13
09/13/2004 15:49:56;0008;   pbs_mom;Job;12200.pbsserv;Job Modified at 
request of PBS_Server at pbsserv
09/13/2004 15:49:58;0001;   pbs_mom;Svr;pbs_mom;Unknown error 15010 
(15010) in job_start_error from node 172.16.32.16:15003, 12200.pbsserv
09/13/2004 15:49:58;0001;   pbs_mom;Job;12200.pbsserv;pro/epilogue 
failed, file: /var/opt/torque/mom_priv/epilogue, exit: 1, nonzero p/e 
exit status
09/13/2004 15:49:58;0080;   pbs_mom;Job;12200.pbsserv;Obit sent
09/13/2004 15:49:58;0100;   pbs_mom;Req;;Type deletejob request 
received from PBS_Server at pbsserv, sock=11

The curious bit is that it says it's node 3, but IIRC, it was node 1.  
I think this problem has caused pbs_server to hang in the past.

Hope this helps.  Any suggestions as to where to look?  I'm running the 
PBS server on Solaris 9, with a horde of Linux nodes.

Thanks

M


On Jun 8, 2004, at 11:48 AM, Daniel J. Bodony wrote:

> Hello,
>
> Our 56-node cluster with torque 1.0.1p6 with maui 3.2.6p6 running on
> a dedicated frontend, is having a problem that is occuring with
> increasing frequency.  The MOM <-> Server race condition that others
> on this list have commented on shows up in our system and, short of
> a system wide reboot, cannot be cleared.  Both the frontend and
> compute nodes are running kernels 2.4.20-18.9smp on RH9.0.
>
> The path to this error appears to be the following:
>   1.  User submits job using qsub in the normal way.
>   2.  qstat -u username reports the job as queued
>   3.  qstat -an username reports the job as queued *but with assigned
>       nodes*.
>   4.  repeated qstat's sometimes show a RUN condition, but for brief
>       moments only.
>   5.  job remains queued ``forever'' and never runs
>
> On the frontend, the server_log reports many errors like
>
> 06/08/2004 11:46:26;0100;PBS_Server;Req;;Type modifyjob request 
> received from ro
> ot at whitehot.Stanford.EDU, sock=9
> 06/08/2004 11:46:26;0008;PBS_Server;Job;1547.whitehot.Stanford.EDU;Job 
> Modified
> at request of root at whitehot.Stanford.EDU
> 06/08/2004 11:46:26;0100;PBS_Server;Req;;Type movejobfile request 
> received from
> pbs_mom at n6-4, sock=84
> 06/08/2004 
> 11:46:26;0040;PBS_Server;Svr;whitehot.Stanford.EDU;Scheduler sent com
> mand new
> 06/08/2004 11:46:26;0100;PBS_Server;Req;;Type movejobfile request 
> received from
> pbs_mom at n6-4, sock=87
> 06/08/2004 11:46:26;0009;PBS_Server;Job;1547.whitehot.Stanford.EDU;Job 
> Obit noti
> ce received from n6-4 has error 15016
> 06/08/2004 11:46:26;0080;PBS_Server;Req;req_reject;Reject reply 
> code=15016, aux=
> 0, type=56, from pbs_mom at n6-4
> 06/08/2004 11:46:26;0008;PBS_Server;Job;1547.whitehot.Stanford.EDU;MOM 
> rejected
> modify request, error: 15001
> 06/08/2004 11:46:26;0080;PBS_Server;Req;req_reject;Reject reply 
> code=15001, aux=
> 0, type=11, from root at whitehot.Stanford.EDU
>
> The node 'n6-4' in the above is the first node in the nodelist
> associated with the job in question.  Other jobs with similar problems
> also have the first listed node complaining the loudest.
>
> On the node n6-4, the pbs_mom/mom_logs shows entries of the type
>
>  06/08/2004 11:45:35;0001;   pbs_mom;Svr;pbs_mom;job_start_error, 
> job_start_error: sent 10 ABORT requests, should be 11
> 06/08/2004 11:45:35;0080;   
> pbs_mom;Job;1547.whitehot.Stanford.EDU;Obit sent
> 06/08/2004 11:45:35;0001;   pbs_mom;Req;obit reply;Job not found for 
> obit reply
> 06/08/2004 11:45:35;0100;   pbs_mom;Req;;Type statusjob request 
> received from PBS_Server at n0, sock=12
> 06/08/2004 11:45:35;0080;   
> pbs_mom;Job;1547.whitehot.Stanford.EDU;Obit sent
> 06/08/2004 11:45:35;0100;   pbs_mom;Req;;Type modifyjob request 
> received from PBS_Server at n0, sock=10
> 06/08/2004 11:45:35;0008;   pbs_mom;Job;1547.whitehot.Stanford.EDU;Job 
> Modified at request of PBS_Server at n0
> 06/08/2004 11:45:35;0001;   
> pbs_mom;Job;1547.whitehot.Stanford.EDU;server rejected job obit - 
> unexpected job state
> 06/08/2004 11:45:35;0100;   pbs_mom;Req;;Type deletejob request 
> received from PBS_Server at n0, sock=14
> 06/08/2004 11:45:35;0080;   pbs_mom;Req;req_reject;Reject reply 
> code=15001, aux=0, type=6, from PBS_Server at n0
>
> The errors 15001 (unknown JobID) and 15016 (request invalid for job
> state) show up regularly.  Restarting pbs_mom on the nodes does not
> fix the problem; only a cluster-wide reboot.
>
> Is there any help for this?  I noticed that the latested torque release
> (1.1.0p0) does not address this according to the CHANGELOG.
>
> Thanks,
>
> Daniel
> Stanford University
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://supercluster.org/mailman/listinfo/torqueusers
>



More information about the torqueusers mailing list