[torqueusers] Force a job to rerun after mom has crashed

Ken Nielson knielson at adaptivecomputing.com
Wed Aug 24 13:13:13 MDT 2011



----- Original Message -----
> From: "Mahmood Naderan" <nt_mahmood at yahoo.com>
> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
> Cc: "Ken Nielson" <knielson at adaptivecomputing.com>
> Sent: Wednesday, August 24, 2011 1:04:33 PM
> Subject: Re: [torqueusers] Force a job to rerun after mom has crashed
> >But if the failure is because of a system crash you may want the
> 
> >start
> the mom with the -q option which will requeue jobs so they can be
> rerun.
> 
> One question regarding your reply....
> 
> Recently one of our nodes (computing node which pbs_mom is running)
> crashed with kernel panic message. While it was up, it was running an
> MPI job. Although the node crashed but I didn't notice because the job
> was still in the output of showq command.
> 
> After cold restart and starting pbs_mom on the node, I saw no cpu
> usage means that the job didn't restart. However the job id was still
> present in the output of showq command. So I manually use qdel to
> delete that job and submit the MPI job again.
> 
> What I understand from your statement is this:
> node is running pbs_mom => crashed => restart => pbs_mom -q => you can
> see the job is running again.
> 
> Is my understanding correct?

For the most part that is correct. What should happen is the MOM on restart will tell pbs_server to requeue the job. The server will change the state from running to queued and then report that to the scheduler. The scheduler will then change the state of the job to queued and rerun it when it can. If it is in a running state in showq I would check the job in TORQUE using qstat. If qstat says the job is running then TORQUE and scheduler are in sync. If not then just wait for the scheduler to catch up.

Ken


More information about the torqueusers mailing list