[torqueusers] Force a job to rerun after mom has crashed

Mahmood Naderan nt_mahmood at yahoo.com
Wed Aug 24 13:22:42 MDT 2011


>For the most part that is correct. What should happen is the MOM on 
restart will tell pbs_server to requeue the job. The server will change 
the >state from running to queued and then report that to the scheduler. 
The scheduler will then change the state of the job to queued and rerun 
it >when it can. If it is in a running state in showq I would check the 
job in TORQUE using qstat. If qstat says the job is running then TORQUE 
and >scheduler are in sync. If not then just wait for the scheduler to 
catch up.
 
ok thanks.


// Naderan *Mahmood;


----- Original Message -----
From: Ken Nielson <knielson at adaptivecomputing.com>
To: Mahmood Naderan <nt_mahmood at yahoo.com>
Cc: Torque Users Mailing List <torqueusers at supercluster.org>
Sent: Wednesday, August 24, 2011 11:43 PM
Subject: Re: [torqueusers] Force a job to rerun after mom has crashed



----- Original Message -----
> From: "Mahmood Naderan" <nt_mahmood at yahoo.com>
> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
> Cc: "Ken Nielson" <knielson at adaptivecomputing.com>
> Sent: Wednesday, August 24, 2011 1:04:33 PM
> Subject: Re: [torqueusers] Force a job to rerun after mom has crashed
> >But if the failure is because of a system crash you may want the
> 
> >start
> the mom with the -q option which will requeue jobs so they can be
> rerun.
> 
> One question regarding your reply....
> 
> Recently one of our nodes (computing node which pbs_mom is running)
> crashed with kernel panic message. While it was up, it was running an
> MPI job. Although the node crashed but I didn't notice because the job
> was still in the output of showq command.
> 
> After cold restart and starting pbs_mom on the node, I saw no cpu
> usage means that the job didn't restart. However the job id was still
> present in the output of showq command. So I manually use qdel to
> delete that job and submit the MPI job again.
> 
> What I understand from your statement is this:
> node is running pbs_mom => crashed => restart => pbs_mom -q => you can
> see the job is running again.
> 
> Is my understanding correct?

For the most part that is correct. What should happen is the MOM on restart will tell pbs_server to requeue the job. The server will change the state from running to queued and then report that to the scheduler. The scheduler will then change the state of the job to queued and rerun it when it can. If it is in a running state in showq I would check the job in TORQUE using qstat. If qstat says the job is running then TORQUE and scheduler are in sync. If not then just wait for the scheduler to catch up.

Ken



More information about the torqueusers mailing list