[torqueusers] Force a job to rerun after mom has crashed

David Sheen sheen at usc.edu
Wed Aug 24 14:53:25 MDT 2011


Ken,

The node has been taken offline by the administrator for testing.

David

On Wed, Aug 24, 2011 at 4:48 PM, Ken Nielson
<knielson at adaptivecomputing.com> wrote:
>
>
> ----- Original Message -----
>> From: "David Sheen" <sheen.david at gmail.com>
>> To: "Mahmood Naderan" <nt_mahmood at yahoo.com>, "Torque Users Mailing List" <torqueusers at supercluster.org>
>> Cc: "Ken Nielson" <knielson at adaptivecomputing.com>
>> Sent: Wednesday, August 24, 2011 1:59:17 PM
>> Subject: Re: [torqueusers] Force a job to rerun after mom has crashed
>> What if restarting the MOM isn't an option?
>
> David,
>
> I don't understand your question. Do you mean you can't restart the MOM ever?
>
> Ken
>
>>
>> On Wed, Aug 24, 2011 at 3:22 PM, Mahmood Naderan
>> <nt_mahmood at yahoo.com> wrote:
>> >>For the most part that is correct. What should happen is the MOM on
>> > restart will tell pbs_server to requeue the job. The server will
>> > change
>> > the >state from running to queued and then report that to the
>> > scheduler.
>> > The scheduler will then change the state of the job to queued and
>> > rerun
>> > it >when it can. If it is in a running state in showq I would check
>> > the
>> > job in TORQUE using qstat. If qstat says the job is running then
>> > TORQUE
>> > and >scheduler are in sync. If not then just wait for the scheduler
>> > to
>> > catch up.
>> >
>> > ok thanks.
>> >
>> >
>> > // Naderan *Mahmood;
>> >
>> >
>> > ----- Original Message -----
>> > From: Ken Nielson <knielson at adaptivecomputing.com>
>> > To: Mahmood Naderan <nt_mahmood at yahoo.com>
>> > Cc: Torque Users Mailing List <torqueusers at supercluster.org>
>> > Sent: Wednesday, August 24, 2011 11:43 PM
>> > Subject: Re: [torqueusers] Force a job to rerun after mom has
>> > crashed
>> >
>> >
>> >
>> > ----- Original Message -----
>> >> From: "Mahmood Naderan" <nt_mahmood at yahoo.com>
>> >> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
>> >> Cc: "Ken Nielson" <knielson at adaptivecomputing.com>
>> >> Sent: Wednesday, August 24, 2011 1:04:33 PM
>> >> Subject: Re: [torqueusers] Force a job to rerun after mom has
>> >> crashed
>> >> >But if the failure is because of a system crash you may want the
>> >>
>> >> >start
>> >> the mom with the -q option which will requeue jobs so they can be
>> >> rerun.
>> >>
>> >> One question regarding your reply....
>> >>
>> >> Recently one of our nodes (computing node which pbs_mom is running)
>> >> crashed with kernel panic message. While it was up, it was running
>> >> an
>> >> MPI job. Although the node crashed but I didn't notice because the
>> >> job
>> >> was still in the output of showq command.
>> >>
>> >> After cold restart and starting pbs_mom on the node, I saw no cpu
>> >> usage means that the job didn't restart. However the job id was
>> >> still
>> >> present in the output of showq command. So I manually use qdel to
>> >> delete that job and submit the MPI job again.
>> >>
>> >> What I understand from your statement is this:
>> >> node is running pbs_mom => crashed => restart => pbs_mom -q => you
>> >> can
>> >> see the job is running again.
>> >>
>> >> Is my understanding correct?
>> >
>> > For the most part that is correct. What should happen is the MOM on
>> > restart will tell pbs_server to requeue the job. The server will
>> > change the state from running to queued and then report that to the
>> > scheduler. The scheduler will then change the state of the job to
>> > queued and rerun it when it can. If it is in a running state in
>> > showq I would check the job in TORQUE using qstat. If qstat says the
>> > job is running then TORQUE and scheduler are in sync. If not then
>> > just wait for the scheduler to catch up.
>> >
>> > Ken
>> >
>> > _______________________________________________
>> > torqueusers mailing list
>> > torqueusers at supercluster.org
>> > http://www.supercluster.org/mailman/listinfo/torqueusers
>> >
>


More information about the torqueusers mailing list