[torqueusers] Serious torque failure problems

Garrick Staples garrick at usc.edu
Mon Aug 15 13:07:11 MDT 2005


Which exact version of TORQUE are you using?  Is it possible that you have a
version mismatch between pbs_server and pbs_mom?  Which scheduler are you
using?

Do you have prologue scripts?  How long do they take to run?

Can you increase the loglevel on the MOMs?  Just send a few USR1 signals.


> 21:54:25;0008;PBS_Server;Job;69483.seychelles.nmr.mgh.harvard.edu;Job 
> Modified at request of Scheduler at seychelles.nmr.mgh.harvard.edu
> 08/12/2005 
> 21:54:25;0008;PBS_Server;Job;69483.seychelles.nmr.mgh.harvard.edu;Job Run 
> at request of Scheduler at seychelles.nmr.mgh.harvard.edu
> 08/12/2005 
> 21:54:25;0008;PBS_Server;Job;69483.seychelles.nmr.mgh.harvard.edu;Job 
> Modified at request of Scheduler at seychelles.nmr.mgh.harvard.edu
> 08/12/2005 
> 21:55:13;0008;PBS_Server;Job;69483.seychelles.nmr.mgh.harvard.edu;Job 
> Modified at request of Scheduler at seychelles.nmr.mgh.harvard.edu
> 08/12/2005 
> 21:55:13;0008;PBS_Server;Job;69483.seychelles.nmr.mgh.harvard.edu;Job Run 
> at request of Scheduler at seychelles.nmr.mgh.harvard.edu

At this point I'm already worried, there should only be 1 "Job Run" request.
Did the scheduler just run the job on 2 different nodes in less than a minute?
Can you see in the scheduler's logs at this point?


> 08/12/2005 
> 21:55:20;0008;PBS_Server;Job;69483.seychelles.nmr.mgh.harvard.edu;unable to 
> run job, MOM rejected

If that MOM actually ran the job, it should not have rejected it.  At this
point the damage has been done and we're all screwed up.  I'd love to see more
detailed MOM logs between the "Job Run" and the "MOM rejected".

But I can easily see this as a protocol mismatch if pbs_mom is newer than
pbs_server.


-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050815/0d802287/attachment.bin


More information about the torqueusers mailing list