[torqueusers] Serious torque failure problems

Paul Raines raines at nmr.mgh.harvard.edu
Fri Aug 12 21:28:31 MDT 2005


Here is an example of job that ended up being run on three different
nodes at the same time.  The pbs_server log has:


[root at seychelles reza]# grep Job.69483 /var/spool/PBS/server_logs/20050812 08/12/2005 21:12:13;0008;PBS_Server;Job;69483.seychelles.nmr.mgh.harvard.edu;Job Queued at request of reza at seychelles.nmr.mgh.harvard.edu, owner = reza at seychelles.nmr.mgh.harvard.edu, job name = pbsjob_272, queue = corporal
08/12/2005 21:51:13;0086;PBS_Server;Job;69483.seychelles.nmr.mgh.harvard.edu;Requeueing job, substate: 10 Requeued in queue: corporal
08/12/2005 21:54:25;0008;PBS_Server;Job;69483.seychelles.nmr.mgh.harvard.edu;Job Modified at request of Scheduler at seychelles.nmr.mgh.harvard.edu
08/12/2005 21:54:25;0008;PBS_Server;Job;69483.seychelles.nmr.mgh.harvard.edu;Job Run at request of Scheduler at seychelles.nmr.mgh.harvard.edu
08/12/2005 21:54:25;0008;PBS_Server;Job;69483.seychelles.nmr.mgh.harvard.edu;Job Modified at request of Scheduler at seychelles.nmr.mgh.harvard.edu
08/12/2005 21:55:13;0008;PBS_Server;Job;69483.seychelles.nmr.mgh.harvard.edu;Job Modified at request of Scheduler at seychelles.nmr.mgh.harvard.edu
08/12/2005 21:55:13;0008;PBS_Server;Job;69483.seychelles.nmr.mgh.harvard.edu;Job Run at request of Scheduler at seychelles.nmr.mgh.harvard.edu
08/12/2005 21:55:20;0008;PBS_Server;Job;69483.seychelles.nmr.mgh.harvard.edu;unable to run job, MOM rejected
08/12/2005 21:55:20;0008;PBS_Server;Job;69483.seychelles.nmr.mgh.harvard.edu;Job Modified at request of Scheduler at seychelles.nmr.mgh.harvard.edu
08/12/2005 21:56:00;0008;PBS_Server;Job;69483.seychelles.nmr.mgh.harvard.edu;Job Modified at request of Scheduler at seychelles.nmr.mgh.harvard.edu
08/12/2005 21:56:00;0008;PBS_Server;Job;69483.seychelles.nmr.mgh.harvard.edu;Job Run at request of Scheduler at seychelles.nmr.mgh.harvard.edu
08/12/2005 21:56:06;0008;PBS_Server;Job;69483.seychelles.nmr.mgh.harvard.edu;unable to run job, MOM rejected
08/12/2005 21:56:06;0008;PBS_Server;Job;69483.seychelles.nmr.mgh.harvard.edu;Job Modified at request of Scheduler at seychelles.nmr.mgh.harvard.edu
08/12/2005 21:56:50;0008;PBS_Server;Job;69483.seychelles.nmr.mgh.harvard.edu;Job Modified at request of Scheduler at seychelles.nmr.mgh.harvard.edu
08/12/2005 21:56:50;0008;PBS_Server;Job;69483.seychelles.nmr.mgh.harvard.edu;Job Run at request of Scheduler at seychelles.nmr.mgh.harvard.edu
08/12/2005 21:56:52;0010;PBS_Server;Job;69483.seychelles.nmr.mgh.harvard.edu;Exit_status=-1
08/12/2005 22:11:19;0009;PBS_Server;Job;69483.seychelles.nmr.mgh.harvard.edu;Reject reply code=15001(Unknown Job Id), aux=0, type=JobObituary, from pbs_mom at node0374.seychelles.nmr.mgh.harvard.edu
08/12/2005 22:56:15;0009;PBS_Server;Job;69483.seychelles.nmr.mgh.harvard.edu;Reject reply code=15001(Unknown Job Id), aux=0, type=JobObituary, from pbs_mom at node0370.seychelles.nmr.mgh.harvard.edu

Essentially the two nodes it thinks "MOM rejected" actually ran the job and
when they later finished and reported, we get these "Unknown Job Id" errors

On the two nodes in the final two lines above, node0370 has this in the log:

08/12/2005 21:55:30;0001;   pbs_mom;Job;TMomFinalizeJob3;job 69483.seychelles.nmr.mgh.harvard.edu started, pid = 22128
08/12/2005 22:56:15;0080;   pbs_mom;Job;69483.seychelles.nmr.mgh.harvard.edu;scan_for_terminated: job 69483.seychelles.nmr.mgh.harvard.edu task 1 terminated, sid 22128
08/12/2005 22:56:15;0008;   pbs_mom;Job;69483.seychelles.nmr.mgh.harvard.edu;Terminated
08/12/2005 22:56:15;0001;   pbs_mom;Job;69483.seychelles.nmr.mgh.harvard.edu;server rejected job obit - 15001

On the the final node it ran on at 21:56, there is only the one line:

08/12/2005 21:56:51;0001;   pbs_mom;Job;TMomFinalizeJob3;job 69483.seychelles.nmr.mgh.harvard.edu started, pid = 22118



More information about the torqueusers mailing list