[torqueusers] Force a job to rerun after mom has crashed

Ken Nielson knielson at adaptivecomputing.com
Wed Aug 24 14:48:55 MDT 2011



----- Original Message -----
> From: "David Sheen" <sheen.david at gmail.com>
> To: "Mahmood Naderan" <nt_mahmood at yahoo.com>, "Torque Users Mailing List" <torqueusers at supercluster.org>
> Cc: "Ken Nielson" <knielson at adaptivecomputing.com>
> Sent: Wednesday, August 24, 2011 1:59:17 PM
> Subject: Re: [torqueusers] Force a job to rerun after mom has crashed
> What if restarting the MOM isn't an option?

David,

I don't understand your question. Do you mean you can't restart the MOM ever? 

Ken

> 
> On Wed, Aug 24, 2011 at 3:22 PM, Mahmood Naderan
> <nt_mahmood at yahoo.com> wrote:
> >>For the most part that is correct. What should happen is the MOM on
> > restart will tell pbs_server to requeue the job. The server will
> > change
> > the >state from running to queued and then report that to the
> > scheduler.
> > The scheduler will then change the state of the job to queued and
> > rerun
> > it >when it can. If it is in a running state in showq I would check
> > the
> > job in TORQUE using qstat. If qstat says the job is running then
> > TORQUE
> > and >scheduler are in sync. If not then just wait for the scheduler
> > to
> > catch up.
> >
> > ok thanks.
> >
> >
> > // Naderan *Mahmood;
> >
> >
> > ----- Original Message -----
> > From: Ken Nielson <knielson at adaptivecomputing.com>
> > To: Mahmood Naderan <nt_mahmood at yahoo.com>
> > Cc: Torque Users Mailing List <torqueusers at supercluster.org>
> > Sent: Wednesday, August 24, 2011 11:43 PM
> > Subject: Re: [torqueusers] Force a job to rerun after mom has
> > crashed
> >
> >
> >
> > ----- Original Message -----
> >> From: "Mahmood Naderan" <nt_mahmood at yahoo.com>
> >> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
> >> Cc: "Ken Nielson" <knielson at adaptivecomputing.com>
> >> Sent: Wednesday, August 24, 2011 1:04:33 PM
> >> Subject: Re: [torqueusers] Force a job to rerun after mom has
> >> crashed
> >> >But if the failure is because of a system crash you may want the
> >>
> >> >start
> >> the mom with the -q option which will requeue jobs so they can be
> >> rerun.
> >>
> >> One question regarding your reply....
> >>
> >> Recently one of our nodes (computing node which pbs_mom is running)
> >> crashed with kernel panic message. While it was up, it was running
> >> an
> >> MPI job. Although the node crashed but I didn't notice because the
> >> job
> >> was still in the output of showq command.
> >>
> >> After cold restart and starting pbs_mom on the node, I saw no cpu
> >> usage means that the job didn't restart. However the job id was
> >> still
> >> present in the output of showq command. So I manually use qdel to
> >> delete that job and submit the MPI job again.
> >>
> >> What I understand from your statement is this:
> >> node is running pbs_mom => crashed => restart => pbs_mom -q => you
> >> can
> >> see the job is running again.
> >>
> >> Is my understanding correct?
> >
> > For the most part that is correct. What should happen is the MOM on
> > restart will tell pbs_server to requeue the job. The server will
> > change the state from running to queued and then report that to the
> > scheduler. The scheduler will then change the state of the job to
> > queued and rerun it when it can. If it is in a running state in
> > showq I would check the job in TORQUE using qstat. If qstat says the
> > job is running then TORQUE and scheduler are in sync. If not then
> > just wait for the scheduler to catch up.
> >
> > Ken
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >


More information about the torqueusers mailing list